A Robust Energy/Area-Efficient Forwarded-Clock Receiver With All-Digital Clock and Data Recovery in 28-nm CMOS for High-Density InterconnectsIEEE Transactions on Very Large Scale Integration (VLSI) Systems

About

Authors
Shuai Chen, Hao Li, Patrick Yin Chiang
Year
2015
DOI
10.1109/tvlsi.2015.2409987
Subject
Hardware and Architecture / Electrical and Electronic Engineering / Software

Text

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

A Robust Energy/Area-Efficient Forwarded-Clock

Receiver With All-Digital Clock and Data

Recovery in 28-nm CMOS for

High-Density Interconnects

Shuai Chen, Member, IEEE, Hao Li, Student Member, IEEE, and Patrick Yin Chiang, Member, IEEE

Abstract— This paper presents a robust energy/area-efficient receiver fabricated in a 28-nm CMOS process. The receiver consists of eight data lanes plus one forwarded-clock lane supporting the hypertransport standard for high-density chip-to-chip links. The proposed all-digital clock and data recovery (ADCDR) circuit, which is well suited for today’s

CMOS process scaling, enables the receiver to achieve low power and area consumption. The ADCDR can enter into open loop after lock-in to save power and avoid clock dithering phenomenon. Moreover, to compensate the open loop, a phase tracking procedure is proposed to enable the ADCDR to track the phase drift due to the voltage and temperature variations.

Furthermore, the all-digital delay-locked loop circuit integrated in the ADCDR can generate accurate multiphase clocks with the proposed calibrated locking algorithm in the presence of process variations. The precise multiphase clocks are essential for the half-rate sampling and Alexander-type phase detecting.

Measurement results show that the receiver can operate at a data rate of 6.4 Gbits/s with a bit error rate <10−12, consuming 7.5-mW per lane (1.2 pJ/bit) under a 0.85 V power supply.

With ADCDR’s phase tracking, the receiver performs better in jitter tolerance and achieves a 500-kHz bandwidth, which is high enough to track the phase drift. The receiver core occupies an area of 0.02 mm2 per lane.

Index Terms— All-digital clock and data recovery (ADCDR), delay-locked loop (DLL), forwarded-clock (FC) receiver, high-density interconnect, jitter tolerance, multicore processor, process variation, voltage and temperature drift.

I. INTRODUCTION

H IGH-DENSITY forwarded-clock (FC) links havebeen widely used in the multicore processor’s interfaces for the chip-to-chip interconnects, such as

Quick-Path Interconnect (QPI), hypertransport, and DDR

Manuscript received November 30, 2014; revised January 19, 2015; accepted February 17, 2015. This work was supported in part by the National Science and Technology Major Project of China under

Grant 2009ZX01028-002-003 and in part by the National Natural Science

Foundation of China under Grant 61221062.

S. Chen is with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China (e-mail: shuaichen.1986@gmail.com).

H. Li is with the School of Electrical Engineering and Computer

Science, Oregon State University, Corvallis, OR 97331 USA (e-mail: freman231@gmail.com).

P. Y. Chiang is with the School of Electrical Engineering and Computer

Science, Oregon State University, Corvallis, OR 97331 USA, and also with the State Key Laboratory of ASIC, Fudan University, Shanghai 200032, China (e-mail: pchiang@eecs.oregonstate.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2015.2409987

Fig. 1. FC links. (a) Data-rate-per-pin trend. (b) Overview. standards [1]–[3]. To meet the demand for the massive data throughput in high-performance computing systems, the bandwidth of these links has been increasing aggressively in recent years, as plotted in Fig. 1(a). For example, the latest 64-lane QPI interface published in [4] is able to provide an aggregate bandwidth up to 1 Tbit/s for the enterprise processors. Moreover, according to the ITRS roadmap [5], the data rate per I/O pin will continue to rise by ∼1.2× every year in the future. However, the ever-growing data rate will contribute to excessive I/O power consumption, which increases the cost of thermal dissipation and even damage the chip. Therefore, improving power efficiency has become the major concern of the SerDes design as well as the jitter performance.

An overview of the FC link is shown in Fig. 1(b).

A dedicated clock is delivered from the transmitter to receiver 1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 2. Proposed 8 + 1-lane FC receiver. and shared by multiple data lanes. Although the additional clock lane consumes extra pins, area, and power, all of these can be amortized among many data lanes. In addition, this source-synchronous architecture has the advantages of correlated jitter tracking and low-complexity implementation, which are beneficial to the jitter tolerance and the power saving [6]–[9]. The clock and data recovery (CDR) circuit in the FC receiver is used to align the received clock with the data for error-free sampling and it has become the most critical and power/area-hungry component in receiver.

Half-rate and quarter-rate clocking architectures are widely adopted in FC receivers due to their amelioration of jitter amplification in lossy channels. In such receivers, multiphase clocks are necessary for the subrate bang-bang (BB) (Alexander-type) CDR and data sampling [10]. For example, in a half-rate BB CDR, a pair of orthogonal clocks are demanded for detecting the phase relationship between the received data and the clock. Phase-locked loop (PLL) and delay-locked loop (DLL) can be used to generate the required clocks. Although PLLs can perform better in high-frequency jitter filtering, DLLs are often preferred because of their ease of implementation, good stability, and small area/power penalties [11]–[13].