## Clock-biased local bit line for high performance register files

## N. Gong, J. Wang, S. Jiang and R. Sridhar

A clock-biased local bit line (CB-LBL) scheme is presented to achieve high access speed and energy efficient operation for register files. Simulation results in 32nm high-k/metal gate CMOS predictive technology show 49.5% read latency reduction and 39.4% power savings, with less than 5% impact on noise immunity and within 1% increase in area overhead. Additionally, CB-LBL can achieve 99.8% parametric yield compared to its original 47.6% (conventional LBL design), while being scalable.

Introduction: In modern microprocessors, the register files read stage is usually on the critical path and its read latency limits the achievable maximum operating frequency [1-4]. Accordingly, wide fan-in dynamic circuits are forced in use for LBL and global bit lines (GBL) in register files to speed up the read operation. However, there are two main challenges to design LBL: 1. the bit line structure takes up to 70% of the dynamic power consumption of register files, so the power efficiency of LBL has become a great concern; 2. the aggressive scaling of CMOS technology along with increasing levels of process variations has adversely impacted the yield of LBL. Since the PMOS keeper usually can be redesigned with low overhead, many researchers focus on the LBL design with effective keeper techniques. However, state-of-the-art LBL schemes cannot solve the above issues [1-4]. In this Letter, a clock-biased LBL (CB-LBL) scheme is proposed for high performance register files. The novelty of our contribution is that the proposed CB-LBL provides a significant read performance boost, power savings, parametric yield improvement, while still maintaining similar noise immunity and low design complexity.

Proposed CB-LBL: The bit line topology of a 64-entries × 32b register file shown in Fig. 1a consists of 8-way LBLs and each LBL supports a single-ended read with eight cells followed by a 2-way merge and a 4-way GBL. Unlike a conventional LBL design with standard threshold voltage (Vth) PMOS keeper, CB-LBL employs a high Vth PMOS keeper in low-power (LP) technology and its body bias voltage (Vbk) has the same period as the clock signal of LBL (CLK\_LBL). In the precharge stage (CLK\_LBL = 0), the LP PMOS pre-charger is 'ON' and it charges the dynamic node of LBL to  $V_{dd}\!.$  When CLK\_LBL becomes high, LBL enters the evaluation stage. At the start of evaluation, V<sub>bk</sub> is kept at V<sub>dd</sub> and the LP keeper works as a weak keeper to reduce the contention current. Therefore, if there is a conductive pulldown path, the dynamic node will discharge to zero with higher speed and lower short-circuit current. After this switching process is completed,  $V_{bk}$  is reduced to  $V_{bk\_min}$  and a strong keeper with forward body bias (FBB) voltage is applied to improve the robustness to noise and leakage current.



Fig. 1 Schematic of CB-LBL technique (Fig. 1a), BBG (Fig. 1b)

Our body bias generator (BBG) of  $V_{bk}$  uses a 'clock-delay-unitcombined' scheme, as shown in Fig. 1*b*. First, an inversed signal  $V_{delayed}$  is extracted from the clock delay unit between CLK\_LBL and CLK\_GBL, allowing a fast switching process with a weak keeper. Next, an LP PMOS transistor (adjuster) is connected to raise the minimum value of  $V_{delayed}$  from ground to  $V_{bk_min}$ , enabling the noise tolerance with a strong keeper once the switching process is completed. Note that the value of  $V_{bk_min}$  cannot be too low, otherwise the keeper will be strongly forward biased to degrade the noise immunity [5]. Since the value of  $V_{bk_min}$  depends on the size and  $V_{th}$  of the adjuster, sizing the adjuster carefully is required to get the optimal  $V_{bk_min}$ . Our BBG scheme only inserts a PMOS transistor in the existing delayed clock stage between LBL and GBL, without the need for an extra body bias voltage generator. As a result, our BBG scheme achieves a significant improvement in terms of system complexity compared to previous body biasing techniques [3, 4].

*Results:* To validate the effectiveness of the proposed CB-LBL scheme, two 2-read, 1-write ported 64-entries  $\times$  32b register files (without error correction code bits) with conventional LBL and CB-LBL schemes were designed in 32 nm high-k/metal gate CMOS predictive technology for 8 GHz operation at 1V and 110°C. The parasitic RC data are based on scaling/extrapolation from TSMC 180 nm process and the *ITRS* Roadmap [6].

Fig. 2*a* compares the LBL output waveform of two register files and shows that our BBG scheme works well: the amplitude of  $V_{bk}$  ranges from 0.56V ( $V_{bk\_min}$ ) to 1V and results in a faster switching process. Compared to the conventional design, CB-LBL achieves 49.5% reduction in read access latency (low-to-high). Also, owing to its lower sub-threshold leakage current and smaller short-circuit current, CB-LBL decreases the power consumption of register files from 3.91 to 2.37 mW, resulting in 39.4% power reduction. Note that since more LBLs are used as the number of reading ports increases, the power efficiency of the CB-LBL scheme is more pronounced for register files with more reading ports.



Fig. 2 Comparison of LBL output waveform (Fig. 2a), statistical result under process variations (Fig. 2b)

Based on a rigorous Monte Carlo statistical approach, we simulated 5000 chips considering random variations in gate length,  $V_{\text{th}},$  and gate oxide thickness with  $3\sigma = 12$ , 40, and 5% [6], respectively. As shown in Fig. 2b, CB-LBL results in  $\sim 4 \times$  higher process variation immunity  $(\mu/\sigma)$ . For delay constraint of  $D_{max} = 1.1 \times D_{norm}$  and power constraint of  $P_{max} = 1.05 \times P_{norm}$ , CB-LBL improves the parametric yield from 47.6 to 99.8%. This is because CB-LBL uses FBB to decreases the drain-to-body voltage and drain-induced-barrier-lowering (DIBL) effect, thereby reducing the sensitivity of V<sub>th</sub> to process variations. Also, we used the unity noise gain (UNG) metric in [3] to characterise the noise immunity. A noise pulse with 50 ps duration (which is larger than the gate delay) is applied to all inputs of LBL and the amplitude is varied. The obtained UNG value of the basic design and our design are 0.66 and 0.63V, respectively. Accordingly, CB-LBL achieves similar (within 5%) noise immunity to the conventional LBL design.

Fig. 3 shows the layout design of two LBL schemes based on conservative MOSIS deep sub-micromet design rules [7]. As shown, our BBG scheme does not introduce any area penalty (Fig. 3c). Compared to the basic LBL design (Fig. 3a), the area of CB-LBL (Fig. 3b) is increased from  $688 \times 1792 \text{ nm}^2$  to  $1264 \times 1184 \text{ nm}^2$ , resulting in  $\sim 21\%$  area overhead. According to CACTI 5.3 [8], the percentage of LBL of the

total register files area is about 1.37% for 32nm 2-read, 1-write ported 64-entries  $\times 32b$  register files. Thus, the area overhead of CB-LBL is less than 1% of the whole register files. In addition, we explored the design space for CB-LBL in 22 and 16 nm high-k/metal gate technologies. As observed in Table 1, CB-LBL achieves a similar improvement in power efficiency and access speed in each technology. Also, CB-LBL consistently outperforms the conventional design in terms of process variation immunity and enhances the parametric yield effectively, with a similar robustness to noise.



Fig. 3 Layout design of conventional LBL (Fig. 3a), CB-LBL (Fig. 3b) and our BBG scheme of  $V_{bk}$  (Fig. 3c)

| Pı | rocess             | V <sub>dd</sub><br>(V) | Latency | Power | UNG<br>(V) | Area<br>(nm <sup>2</sup> ) | Area <sup>1</sup><br>penalty | Yield<br>(%) | Size of adjustor | V <sub>bk</sub><br>(V) |
|----|--------------------|------------------------|---------|-------|------------|----------------------------|------------------------------|--------------|------------------|------------------------|
| 22 | Basic <sup>2</sup> | 1                      | 1       | 1     | 0.63       | $473 \times 1232$          | -                            | 32.3         | -                | -                      |
| nm | Our <sup>3</sup>   | 1                      | 59%     | 61%   | 0.61       | $869 \times 814$           | $\leq 1\%$                   | 93.6         | 44/22            | 0.53                   |
| 16 | Basic              | 0.8                    | 1       | 1     | 0.55       | $344 \times 896$           | -                            | 26.8         | -                | -                      |
| nm | Our                | 0.8                    | 57%     | 64%   | 0.53       | 632 × 592                  | ≤1%                          | 85.1         | 32/16            | 0.42                   |

| Table 1: Design space | of our LBL scheme | in scaled technologies |
|-----------------------|-------------------|------------------------|
|-----------------------|-------------------|------------------------|

<sup>1</sup> Area overhead of whole register files <sup>2</sup> Basic LBL

<sup>3</sup> Our proposed CB-LBL

*Conclusion:* In this Letter, we propose a novel CB-LBL design for low power and high performance register files. Table 2 compares the CB-LBLs' performance with the state-of-the-art. Our design exhibits the highest read access performance, except for [4], which, however, is realised with considerable power penalty ( $\sim$ 4.2×). CB-LBL also demonstrates best power efficiency, process variation immunity, hardware complexity, with good noise immunity and scalability.

|  | Table | 2: | Com | parison | with | prior art | Ś |
|--|-------|----|-----|---------|------|-----------|---|
|--|-------|----|-----|---------|------|-----------|---|

|                                | ISSCC'10[1]             | ISSCC'11[2]              | TVLSI'10[3]               | TVLSI'11[4]             | This work                 |
|--------------------------------|-------------------------|--------------------------|---------------------------|-------------------------|---------------------------|
| LBL<br>schemes                 | Programmable<br>keeper  | Conditional keeper       | Coupled<br>keeper         | Rate sensing<br>keeper  | Clock<br>biased<br>keeper |
| Size and<br>ports              | 64 entries<br>×32b,1R1W | 144 entries<br>×78b,2R1W | -                         | 32 entries<br>×8b, 2R1W | 64 entries<br>×32b,2R1W   |
| Latency <sup>2</sup>           | 76%                     | -                        | 80%                       | 53%                     | 50.5%                     |
| Power <sup>2</sup>             | -                       | -                        | 80%                       | 4.2 ×                   | 60.6%                     |
| Noise<br>immunity <sup>2</sup> | -                       | -                        | Same (by resizing)        | 1.6 ×                   | Similar<br>(≤5%)          |
| Additional<br>hardware         | Delay<br>chain          | Delay<br>chain           | Sensor, bias<br>circuitry | Bias<br>circuitry       | PMOS<br>adjuster          |
| Area/gate                      | -                       | -                        | $>1.15 \times 1$          | 1.70 ×                  | 1.21 ×                    |
| Process<br>immunity            | _                       | _                        | ~1.42 ×                   | _                       | $\sim$ 4 ×                |
| Scalability                    | _                       | -                        | _                         | -                       | Good                      |
| Technology                     | 32nm                    | 45nm                     | 90nm                      | 130nm                   | 32nm                      |

<sup>1</sup> This value (1.15×) does not include area of sensor and bias circuitry <sup>2</sup> Normalised value of different LBL designs to conventional design © The Institution of Engineering and Technology 2012 10 March 2012

doi: 10.1049/el.2012.0039

One or more of the Figures in this Letter are available in colour online. N. Gong, S. Jiang and R. Sridhar (*University at Buffalo, State University of New York, Buffalo, NY, USA*)

E-mail: rsridhar@buffalo.edu

J. Wang (VLSI and System Lab, Beijing University of Technology, Beijing, People's Republic of China)

## References

- Agarwal, A., et al.: 'A 320mV-to-1.2V on-die fine-grained reconfigurable fabric for DSP/media accelerators in 32nm CMOS'. Proc. ISSCC 2010, San Francisco, CA, USA, 2010, pp. 328–330
- 2 Ditlow1, G.S., *et al.*: 'A 4R2W register file for a 2.3GHz wire-speed POWERTM processor with double-pumped write operation'. Proc. ISSCC 2011, San Francisco, CA, USA, 2011, pp. 256–258
- 3 Dadgour, H.F., and Banerjee, K.: 'A novel variation-tolerant keeper architecture for high-performance low-power wide fan-in dynamic OR gates', *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, 2010, 18, (11), pp. 1567–1577
- 4 Jeyasingh, R., et al.: 'Adaptive keeper design for dynamic logic circuits using rate sensing technique', *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst., 2011, **19**, (2), pp. 295–304
- 5 Luo, H., et al.: 'Bulk-compensated technique and its application to subthreshold ICs', *Electron. Lett.*, 2010, 46, (16), pp. 1105–1106
- 6 ITRS 2009/2010. http://public.itrs.net/
- 7 MOSIS deep design rules. http://www.mosis.com/
- 8 HP Laboratories: 'Cacti 5.1: A Tool to Model Large Caches', 2008