# NOVEL ADAPTIVE KEEPER LBL TECHNIQUE FOR LOW POWER AND HIGH PERFORMANCE REGISTER FILES

Na Gong<sup>1</sup>, Geng Tang<sup>1</sup>, Jinhui Wang<sup>2</sup> and Ramalingam Sridhar<sup>1</sup> <sup>1</sup>University at Buffalo, State University of New York, Buffalo, NY, USA <sup>2</sup>VLSI and System Lab, Beijing University of Technology, Beijing, China {nagong,rsridhar}@buffalo.edu

# ABSTRACT

This paper develops a novel adaptive keeper local bit line (LBL) technique to achieve low power and high performance register files design. To avoid increasing the implementation hardware overhead, the proposed technique employs a clock-combined unit to generate the body voltage of keeper. We evaluate the effectiveness of the proposed technique in a two-cycle 64-entries  $\times$  32b register file design for 8GHz operation in 1V, 32nm high-K Metal-Gate technology. HSPICE simulation results show that the delay time is reduced by 29% and the power consumption is reduced by 36.1%-46.2% depending on the number of reading ports, as compared to the tradition register files design. Moreover, the proposed technique shows good robustness to noise and process variations.

### I. INTRODUCTION

As a fundamental part in a microprocessor, register files are typically on the critical path and its access time limits the achievable maximum operating frequency. In addition to the access time, power consumption of register files is also an important parameter due to the highly frequent accesses and multiple ports. Therefore, low power and high performance registers files design is an important issue in modern microprocessors. In order to speed up the operation, wide fan-in domino circuits are forced in use for local (LBL) and global bit lines (GBL) in register files [4]. However, due to the parallel pull-down network in wide domino circuits, the robustness to noise, leakage current and process variations has become a great concern, especially, in the design of LBL which usually has larger fan-in number than GBL. A convention solution to improve the robustness is to upsize PMOS keeper transistor to get a strong keeper. However, a strong keeper increases bit-line contention current during evaluate operation, thereby increasing the evaluation delay time and power consumption.

To solve this problem, researchers have proposed several adaptive keeper techniques. In [7], a variable threshold voltage keeper is designed to improve the evaluation speed, robustness, and reduce power consumption. In [8], a forward body bias keeper technique is introduced to enhance the robustness to noise with speed penalty. In [9], another adaptive keeper technique is presented to trade off among a high power/speed efficient operation and the robustness to noise and process variation. However, all of these adaptive keeper techniques require additional power supplies, depending on integration of DC-DC converters on die or adding extra pins to connect off-chip power supplies. As a consequence, these techniques are realized with increased layout area and implementation overheads.

In this paper, a novel adaptive keeper LBL technique is proposed to achieve low power and high performance without increasing hardware complexity. We applies it to a two-cycle 64-entries  $\times$  32b register file design for 8GHz operation in 1V, 32nm high-K Metal-Gate technology. Our simulation results show that the proposed technique achieves good characterizations of speed, power, noise immunity, design complexity, and uncertainty parameter of register file.

# **II. PROPOSED ADAPTIVE KEEPER TECHNIQUE**

In this section, the working function and implementation scheme of the proposed adaptive keeper technique is analyzed.

# A. Novel Adaptive Keeper LBL Technique

Fig. 1 shows the proposed adaptive keeper LBL technique. It adopts a high  $V_t$  keeper with adaptive body bias voltage ( $V_{b_keeper}$ ). The principle of the proposed technique is shown in Fig. 2. In traditional design with low  $V_t$  keeper, a strong keeper is applied in both pre-charge (CLK=0) and evaluation stages (CLK=1), as shown in Fig. 2 (a).

This helps LBL achieve good robustness but the evaluation speed is decreased due to the large contention current of keeper. As an alternative, a weak high V<sub>t</sub> keeper is used to achieve high evaluation speed with the cost of robustness, as shown in Fig. 2 (b). Our proposed design is based on a high V<sub>t</sub> keeper with adaptive body bias voltage and its working principle is shown in Fig.2 (c): in precharge stage, PMOS pre-charger is 'ON' and the dynamic node is charged to  $V_{dd}$ . In this phase, the high-V<sub>t</sub> keeper can be applied with zero body bias and works as a weak keeper. This does not degrade the robustness during pre-charge stage since the dynamic node voltage can be maintained by the "ON" pre-charger. When the clock signal becomes high, LBL enters its evaluation stage. At the beginning of evaluation, if there are conductive pull-down paths, the dynamic node will discharge to zero. During this short transition process, a weak keeper with zero body bias is applied to reduce the contention current, thereby achieving high speed and low short-circuit power consumption. After this transition process finishes, a strong keeper with forward body bias (FBB) voltage (Gndh) is applied to increase the keeper current and thereby improve the robustness to noise. Note that, the FBB voltage Gndh is limited, the keeper will strongly forward bias and produce enough drain-to-body diode current to oppose the drain current, thereby lowering the voltage of the evaluation node and ending the enhancement of the noise immunity as well [8-9].

#### B. Clock-combined FBB generator

To implement our proposed technique, we propose a clock-combined unit to generate V<sub>b keeper</sub> to avoid increasing of hardware complexity, which is shown in Fig.3. The obtained V<sub>b keeper</sub> generated by the clock-based unit is shown in Fig. 2 (d). First, a signal is extracted from the delayed clock stage between  $\Phi LBL$  (clock signal of LBL) and  $\Phi GBL$ (clock signal of GBL) to meet the timing constraint. Next, a PMOS transistor (PMOS adjuster) is used to adjust the amplitude of V<sub>b keeper</sub> to avoid large drain-to-body diode current. Accordingly, the amplitude of obtained  $V_{b \text{ keeper}}$  ranges from  $|V_{thp}|$  (the threshold voltage of adjuster) to V<sub>dd</sub>. Such implementation scheme is based on the existing delayed clock stage between LBL and GBL in register files, and it does not need an extra body bias voltage generator in addition to a core supply voltage. Also, a clock-combined unit can be shared among multiple LBL, so the cost of adjuster is negligible. Therefore, our implementation scheme achieves a significant improvement in terms of system complexity.



Figure 1: Adaptive keeper LBL. based register files



Figure 2: Keepers in different design.



Figure 3: Clock-combined unit based FBB generator

Note that, the only difference between our implementation scheme (Fig.2 (d)) and its working function (Fig.2 (c)) is that the obtained  $V_{b_keeper}$  from clock-combined unit has the same period as the clock signal, so the FBB voltage will be kept for a period of time (pre-charge transition time) when LBL enters in pre-charge stage. Accordingly, in the pre-charge transition process, a low-V<sub>t</sub> strong keeper is used to generate large pull-up current and charge the dynamic node together with pre-charge, so the clock-combined implementation scheme further improves the charging speed in pre-charge stage.

# **III. SIMULATION RESULTS AND DISCUSSIONS**

To evaluate the proposed adaptive keeper LBL technique, three 64-entries  $\times$  32b register files with different techniques are designed: basic register files with low V<sub>t</sub> keeper, register files with high-V<sub>t</sub> keeper and novel register files with the proposed adaptive keeper. All three register files are de-

signed for 8 GHz operation in 1V and they are simulated based on 32 nm High k/Metal-gate technology [10] (V<sub>t</sub> of low V<sub>t</sub> transistors: V<sub>tnlow</sub> =  $|V_{tplow}|$  = 0.49V; V<sub>t</sub> of high V<sub>t</sub> transistors: V<sub>tnhigh</sub> = 0.65V; V<sub>DD</sub> =1V) in this paper. In order to compare the speed, power consumption, and robustness to noise and process variations fairly, all of the transistors in three register files have the same size.

# A. Speed

Fig.4 shows the LBL output waveform in three different register files with two read ports. We can see that the clock-combined unit works very well and it generates an effective V<sub>b keeper</sub> to implement our proposed technique: its amplitude ranges from 0.49V to 1V and meets the timing requirement to adaptively adjust the keeper as shown in Fig.2 (d). Also, Fig.4 compares the delay time of LBL in three register files, which is calculated from 50% of signal swing applied at the clock signal to 50% of signal swing observed at the output of LBL. As observed, our proposed adaptive keeper technique shows significant improvement in delay time. As compared to basic design with low V<sub>t</sub> keeper, it achieves 29% reduction in evaluation delay time. This is because, at the beginning of evaluation stage, the weak keeper in our design decreases



Figure 4: Comparison of the rise time delay and fall time delay of the output of LBL with clock signal in three register files

the contention current of the keeper, as shown in Fig. 5. Therefore, the evaluation delay time is reduced. In addition, our design shows less delay time during pre-charge stage as compared to the design with high  $V_t$  keeper. This is due to the large pull-up current generated by the strong keeper is created in our design in the initial stage of precharge stage. In a word, our design has superior speed characteristic both in rise time and fall time.

#### B. Power Consumption

Fig.6 shows the power consumption comparison of different register files. We can see that, due to the lower sub-threshold leakage current in high  $V_t$  keeper [11], the power consumption of the register files with high- $V_t$  keeper and our designed register files decreases up to 51.2% and 40.2%, respectively. Also, since more LBL are used if read ports increases, the power consumption savings are also increased accordingly.

# C. Noise Immunity

In this paper, the unity noise gain (UNG) metric is used to characterize the immunity to noise, which captures the critical input noise strength [12]. UNG is defined as the amplitude of input noise  $V_{noise}$  that causes an equal amplitude noise pulse at the output node. Accordingly, all the inputs of LBL are driven by noise pulses with the same duration of 100 ps and then the amplitude is varied to get the value of UNG. Since the noise immunity of LBL depends on the size of keeper strongly, so the keepers with different ratio (*K*) are considered:

$$K = \frac{(W/L)_{keeper\_transistor}}{(W/L)_{evaluation\_transistor}}$$

where W and L are the width of length of transistors, respectively.

When the dynamic node is hold in a fixed high state as all pull-down paths are turned 'OFF' in evaluation stage, adaptive keeper technique creates a low-V<sub>t</sub> strong keeper to increase the keeper current and heighten the noise immunity. Therefore, as shown in Fig. 7, UNG of our design is larger than that of high-V<sub>t</sub> design, but is slightly smaller than that of basic design. What's more, *K* determines the strength of holding effect of keeper. As *K* increases, the noise immunity will enhance, which agrees with the simulation results in Fig. 7. Note that, even when *K*=1, the UNG of our design equals 453 mV, which is sufficient to fight against most of noise disturbance in practice.

#### D. Robustness to variations

As the CMOS process advances continually, scaling has resulted in significant increase in the variations of the process parameters, including gate length ( $L_{gate}$ ), threshold voltage ( $V_t$ ), and gate oxide thickness ( $t_{ox}$ ) [13]. Also, the environment variations Therefore, it is necessary to evaluate the effectiveness of our proposed technique under process variations. According to the latest International Technology Roadmap for Semiconductors

(ITRS) [14],  $L_{gate}$ ,  $V_t$ , and  $t_{ox}$  are assumed to have normal Gaussian statistical distributions with a three sigma (3 $\sigma$ ) fluctuation of 12%, 40%, 5%, and 10%, respectively. 1000 Monte Carlo simulations are performed to analyze the Power-Delay-Produce (PDP) of three register files under process variations.











Figure 7: UNG in three register files with different K

The PDP distribution curves of the three register files under process variation are shown in Fig. 8 and the average value (A), standard deviation (SD), and uncertainty parameter (SD/A) of PDP of LBL in different register files are listed in Table 1. We can see that, under process variations, our design reduces the average PDP by up to 42.54% and 61.69%, respectively, as compared to that of basic design and high Vt keeper design. This is nearly similar to the PDP reduction at the nominal design corner. As also observed, our proposed technique shows the best robustness to parameter variations and the uncertainty parameter of our design is decreased up to 53.29 % and 81.49% as compared to another two designs. This is because the adaptive keeper technique uses FBB to decrease drainto-body voltage; this decreases the drain-tochannel junction depletion width and hence reduces DIBL. Smaller DIBL essentially decreases the sensitivity of V<sub>t</sub> to channel length, thereby improving the robustness to process variation.



Figure 8: PDP distribution under process variations

Table 1 Average value (A), standard deviation (SD), and robustness parameter of PDP in three register files

|                   | 1 8                         |                    |        |           |                         |  |  |  |
|-------------------|-----------------------------|--------------------|--------|-----------|-------------------------|--|--|--|
| Register<br>Files | PDP under process variation |                    |        |           |                         |  |  |  |
|                   | Deeie                       | $\text{High } V_t$ | Our    | Reduction | Reduction               |  |  |  |
|                   | Dasic                       |                    |        | vs. Basic | vs. High V <sub>t</sub> |  |  |  |
| Α                 | 1                           | 1.4997             | 0.5746 | 42.54%    | 61.69%                  |  |  |  |
| SD                | 1                           | 3.7341             | 0.2648 | 73.52%    | 92.91%                  |  |  |  |
| SD/A              | 1                           | 2.4900             | 0.4608 | 53.92%    | 81.49%                  |  |  |  |

#### E. Overall Electrical quality

To this point, we can see that different techniques rank differently depending on different design metrics, which is listed in Table 2. In this subsection, a comprehensive design metric is evalu-

| Table 2 Performance comparison of different register files |            |                        |        |                |                       |        |  |  |  |  |
|------------------------------------------------------------|------------|------------------------|--------|----------------|-----------------------|--------|--|--|--|--|
| Register                                                   |            | Performance Comparison |        |                |                       |        |  |  |  |  |
| Files                                                      | Delay time | Power                  | PDP    | Noise Immunity | Uncertainty Parameter | OEQ    |  |  |  |  |
| Basic                                                      | Worst      | Worst                  | Medium | Best           | Medium                | Worst  |  |  |  |  |
| Our design                                                 | Best       | Medium                 | Best   | Medium         | Best                  | Best   |  |  |  |  |
| High V <sub>t</sub>                                        | Best       | Best                   | Worst  | Worst          | Worst                 | Medium |  |  |  |  |

ated to characterize the overall electrical quality (OEQ) of the different register files.

$$OEQ = \frac{Noise\_Im\ munity \times Speed}{Power \ Consumption \times Uncertainly \ Paramete}$$

Based on this quality metric, the OEQ value of our design is  $7.5 \times$  and  $13 \times$  as that of basic design and design with high-V<sub>t</sub> keeper. Therefore, our design has the highest overall electrical quality.

#### **IV. CONCLUSION**

This paper proposes a novel adaptive keeper LBL technique and applies it to 64-entries  $\times$  32b register file design for 8 GHz operation in 1V, 32 nm high-K Metal-Gate technology. Unlike the existing work, our proposed technique employs the clock signal to create FBB voltage of keeper, achieving a low power and high performance LBL design without any hardware and layout area penalty from complex FBB generator. As compared to the basic design, our design reduces 29% and 40.2% of the evaluation delay time and power consumption. Also, under process variations, our proposed technique reduce PDP of register files by up to 42.54% and 61.69%, while reducing the uncertainty parameter by up to 53.29 % and 81.49%, as compared to that of basic design and high-V<sub>t</sub> keeper design. Additionally, our analysis shows that the proposed design results in slightly reducing the noise immunity of LBL. But, even with the smallest keeper (K=1), the UNG of our design equals 453 mV, which is sufficient to fight against most of noise disturbance in practice. Finally, the register files with our proposed technique shows the highest overall electrical quality.

# REFERENCES

- G. S. Ditlow, et al, "A 4R2W Register File for a 2.3 GHz Wire-Speed POWER<sup>™</sup> Processor with Double Pumped Write Operation", International Solid-State Circuits Conference, pp. 256-258, February, 2011.
- S. Rusu, et al. "A 45 nm 8-Core Enterprise Xeon Processor", IEEE Journal of Solid-State Circuits, Vol. 45, No. 1, pp. 7-14, January 2010.
- B. Stackhouse, et al, "A 65 nm 2-Billion Transistor Quad-Core Itanium Processor". *IEEE Journal of Solid-State Circuits*, Vol. 44, No. 1, pp. 18-31, January 2009.
- Ataur R. et al, "Bit-Line Organization in Register Files for Low-Power and High-Performance Applications", International Conference on Electrical and Computer Engineering, pp. 505-508, October, 2006.
- P. Gronowski, "Issues in dynamic logic design," in Design of High-Performance Microprocessor Circuits, A. Chandrakasan,W. J. Bowhill, and F. Fox, Eds. Piscataway, NJ: IEEE Press, 2001, ch. 8, pp. 140-157.
- M. Anders, et al, "Robustness of sub-70 nm dynamic circuits: Analytical techniques and scaling trends," Symposium on VLSI Circuit, pp. 23-24, June, 2001.
- Volkan Kursun, et al, "Domino Logic with Variable Threshold Voltage Keeper", IEEE Transaction on very large scale integration system (VLSI), Vol. 11, No. 6, pp. 1080-1093, December 2003.
- V. Kursun and E. G. Friedman, "Forward Body Biased Keeper for Enhanced Noise Immunity in Domino Logic Circuits," Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 2, pp. 917-920, May 2004
- Na Gong, et al, "Robustness aware high performance high fan-in domino OR logic design", *Journal of Semiconductors*, Vol. 30, No. 6, pp. 065005-1- 065005-4, June 2009.
- 10. PTM Model. http://www.eas.asu.edu/~ptm
- Na Gong and R. Sridhar, "Optimization and Predication of Leakage Current Characteristics in Wide Domino OR Gates Under PVT Variation", IEEE international SOC conference, pp. 19-24, September, 2010.
- Hassan Mostafa, "Novel Timing Yield Improvement Circuitsfor High-Performance Low-Power Wide Fan-In Dynamic OR Gates" *IEEE Transaction on circuits and system-I: Regular paper*, Vol. 58, No. 10, October 2011.
- 13. ITRS 2010. http://public.itrs.net/
- Na Gong, et al, "Analysis and optimization of leakage current characteristics in sub-65 nm dual Vt footed domino circuits", *Microelectronics Journal*, Vol. 39, No. 9, pp. 1149 – 1155, September 2008.