# PNS-FCR: Flexible Charge Recycling Dynamic Circuit Technique for Low-Power Microprocessors

Jinhui Wang, Member, IEEE, Na Gong, Member, IEEE, and Eby G. Friedman, Fellow, IEEE

Abstract—Due to the superior speed and area characteristics, dynamic circuits are widely applied in data paths and other time critical components in modern microprocessors. The high switching activity of dynamic circuits, however, consumes significant power. In this paper, a p-type/n-type dynamic circuit selection (PNS) algorithm and a flexible charge recycling (FCR) design methodology are proposed to achieve high power efficiency in data paths. The effects of technology scaling, data path width, design complexity, clock skew, and environmental conditions are discussed. Simulation results show that the power consumption of an arithmetic and logic unit (ALU) with the proposed PNS-FCR can be reduced by up to 60% as compared with a conventional ALU. An 8-bit ALU test circuit has also been manufactured based on a 0.35-μm Global Foundries technology, demonstrating the power and area efficiency of the proposed methodology.

*Index Terms*—Application conditions, charge recycling, low power, n-type dynamic circuit, p-type dynamic circuit, technology scaling.

#### I. Introduction

VER the past four decades, the number of transistors in a chip has grown continuously [1], [2]. With an increasing transistor density, the power consumption of microprocessors has become a major design issue for a wide range of applications, from ultralow power medical sensors to high performance microprocessors in leading servers [3]–[5]. As a fundamental part of modern microprocessors, data paths perform computing operations, typically along the critical path. The operating speed of the data paths usually determines the achievable operating frequency of the entire microprocessor. At the same time, the data path is one of the most active components and consumes a significant

Manuscript received April 6, 2014; revised September 18, 2014, November 7, 2014, and February 24, 2015; accepted March 31, 2015. This work was supported in part by the Plan Program of Beijing Education Science and Technology Committee, Beijing Municipal Commission of Education under Grant JC002999201301, in part by the Ph.D. Programs Foundation through the Ministry of Education, China, under Grant 20121103120018, in part by the National Natural Science Foundation of China under Grant 60976028 and Grant 61204040, in part by the North Dakota Experimental Program to Stimulate Competitive Research under Grant FAR0023939, Grant FAR0024038, and Grant FAR0022051, and in part by the Beijing Municipal Natural Science Foundation under Grant 4152004.

- J. Wang and N. Gong are with the Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND 58102 USA (e-mail: jinhui.wang.1@ndsu.edu; na.gong@ndsu.edu).
- E. G. Friedman is with the Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627 USA (e-mail: friedman@ece.rochester.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2015.2419255

share of the total power consumption. This situation is further exacerbated for those applications with an intensive computation, such as digital signal microprocessors and multimedia processors with multiple cores [6]. Hence, it is vital to achieve low power data paths in modern microprocessors.

Due to the superior speed and area characteristics, dynamic circuits are widely applied in data paths and other time critical paths [7], [8]. For example, in the 32-nm Intel Itanium microprocessor, code named Poulson, and the 32-nm AMD microprocessor, code named Bulldozer, the on-chip memory and arithmetic and logic unit (ALU) adopt n-type dynamic circuits to minimize latency [1], [7]. However, since the dynamic circuits are usually cascaded to form domino CMOS logic, each stage of dynamic logic requires a static CMOS inverter to ensure that all inputs to each stage are maintained low during the precharge phase [11]. This property makes synthesizing dynamic circuits with Computer Aided Design (CAD) tools more difficult than synthesizing static CMOS circuits. In addition, the varying characteristics of different types of dynamic circuits (n-type and p-type) increase the design complexity of a data path. Unfortunately, the existing solutions are not sufficient to solve these issues.

In this paper, a novel p-type/n-type dynamic circuit selection (PNS) algorithm and a flexible charge recycling (FCR) design methodology are proposed, referred to here as PNS-FCR, which targets low power data paths in modern microprocessors.

The primary contributions of this paper are as follows.

- 1) A novel PNS algorithm is presented to provide charge recycling and explore power saving opportunities for specific applications (Section III-B).
- 2) A design flow to achieve power efficient data paths is presented (Section III).
- 3) An analysis of power efficiency of the PNS-FCR is provided and an analytical model is described for estimating the power savings of PNS-FCR (Section III-D).
- 4) A comprehensive suite of simulations is discussed, evaluating the effects of technology scaling, data path width, design complexity, clock skew, and environmental conditions. These simulations demonstrate that PNS-FCR provides low design complexity, good design flexibility, and significant power savings, while achieving the targeted performance objectives of different applications (Section IV).



Fig. 1. n-p dynamic circuits.

5) An ALU IC is described based on a 0.35-μm Global Foundries technology, demonstrating the power and area efficiency of PNS-FCR (Sections IV-D and IV-E).

The rest of this paper is organized as follows. Relevant background and related work are introduced in Section II. The proposed PNS-FCR is presented in Section III. The evaluation results are provided in Section IV. Finally, the conclusion is drawn in Section V.

#### II. BACKGROUND AND RELATED WORK

## A. n-p Dynamic Circuits

Dynamic circuits can be classified into two categories:

1) n-type and 2) p-type. The n-type dynamic circuits adopt high-speed nMOS transistors to achieve high performance. Alternatively, p-type dynamic circuits use slower pMOS transistors in the evaluation path, and therefore, the speed is slower, but the power efficiency is enhanced due to the suppressed gate and subthreshold leakage current generated by pMOS transistors [17].

The n-p dynamic circuit has been proposed as a race-free dynamic CMOS technique for pipelined circuits [21], [26]. As shown in Fig. 1, an n-p dynamic circuit is constructed of cascaded nMOS and pMOS dynamic logic networks. A clock signal CLK and the complement signal (CLKB) control the circuit operation, which is divided into precharge and evaluation phases. The precharge phase starts when the CLK signal switches low. Nodes N1 and N3 are precharged to high, while nodes N2 and N4 are discharged to low. As CLK rises to high, the circuit evaluates nodes N1, N2, N3, and N4 according to the logic functions of the pull-up network or pull-down network. The n-p dynamic circuit has lower intrinsic delay and requires less silicon area due to the more compact logic than with the static CMOS logic [27].

#### B. Related Work

Many techniques have been developed to achieve dynamic circuits for data paths. Gopalakrishnan and Katkoori [9] propose a binding algorithm-based framework for low leakage data paths. Although effective for some applications, this multiple threshold technique results in considerable speed loss, accordingly is not suitable for high performance applications such as leading servers. A macrodriven data path design methodology has been developed in [10], which generates possible topologies for different macros. In addition, three methodologies for synthesizing dynamic circuits are



Fig. 2. Proposed PNS-FCR methodology.

presented in [11]–[13], but these systems only consider conventional n-type dynamic circuits, failing to include p-type dynamic circuits. In [14] and [15], crosstalk-aware and speed-aware synthesis methodologies are presented, respectively, but neither consider power efficiency. Finally, a dynamic data path is synthesized automatically in [16], but requires a significant silicon area. What is more, the common feature of these techniques is that the potential for low power by combining different types of dynamic circuits is not effectively explored.

To optimize the power efficiency, the PNS-FCR enables PNS with the charge recycling technique. The charge recycling technique has been used in power-gating circuits and race-free pipelines to reduce power consumption [21], [33], [34]. The charge recycling technique between p-type and n-type dynamic circuit is previously used in [32]. In this paper, the charge recycling technique is present as flexibility for more power saving to enable power efficient data path design methodology. As compared with [32], the charge recycling path is optimized as two transistors to decrease power and avoid much more influence of the threshold voltage, a design flow with flexible mechanism is proposed in this paper, and the simulation and chip-test results show that the proposed PNS-FCR provides significant power savings, low design complexity, and good design flexibility. In addition, the presented design methodology can be extended to multiple logic styles, such as static CMOS, pass gate [26], transmission gate [26], and tristate gate [27].

# III. PROPOSED AUTOMATED DESIGN METHODOLOGY

The proposed PNS-FCR, exploring power saving opportunities for data path circuits, is presented in this section. The three-step PNS-FCR design methodology is depicted in Fig. 2.

#### A. Gate Library

Since the regular modules of a data path (including the arithmetic unit, logic unit, and bit shift unit) are typically designed with basic gates, a gate library is the initial step of PNS-FCR. Based on two types of dynamic circuits, the basic gate library is designed to produce a data path, as shown in Fig. 2.

The gate library includes AND gate, OR gate, XOR gate, shift gate, inverter, Carry cell and Sum cell of a full adder, and other basic gates. Note that to conveniently exchange gates and modules, two types of each gate occupy similar layout area.

In the gate library, the delay (D) and power (P) of gates are influenced by the operation condition and the specific input/output switching. Therefore, at the corner of the Process, Voltage, and Temperature (PVT) variations, in the worst case of the input, and with a fan-out of 4, the delay (D) and power (P) of gates in the gate library are simulated to get the reliable performance. For example, the delay (D)and power (P) of a two-input OR gate are designed with a fan-out of 4, and are simulated using the SPICE models including PVT variations. At the same time, when one input is transmitted from 1 to 0 or from 0 to 1, the other one keeps 0, the delay is  $D_{1-0}$  and  $D_{0-1}$ , respectively, then the delay (D)is  $Max\{D_{1-0}, D_{0-1}\}$ ; when two inputs are simultaneously transmitted from 1 to 0 or from 0 to 1, the power is  $P_{1-0}$  and  $P_{0-1}$ , respectively, then the power (D) is Max{ $P_{1-0}$ ,  $P_{0-1}$  }. What is more, the delay and power of the gates strongly depend on the threshold voltage  $(V_{th})$  and supply voltage  $(V_{\rm dd})$ . For a specific  $V_{\rm th}$  and  $V_{\rm dd}$ , the power and delay of each gate are characterized. The delay (D) and power (P)relationship of different types of gates can be expressed as

$$D_{\text{Ntype}}(V_{\text{th}}, V_{\text{dd}}) < D_{\text{Ptype}}(V_{\text{th}}, V_{\text{dd}})$$
 (1)

$$P_{\text{Ntype}}(V_{\text{th}}, V_{\text{dd}}) > P_{\text{Ptype}}(V_{\text{th}}, V_{\text{dd}}).$$
 (2)

Note that, for each logic gate in the library, it includes input and out pins, and the position information for a potential FCR cell connection, which will be discussed in detail in Section IV-B.

In addition to basic logic gates, an FCR cell (including two transistors, pin connecting to CLKB, and pins connecting to  $N_n$  and  $N_p$ ) is also built in gate library for charge recycling, as shown in Fig. 2.

# B. PN Selection Algorithm

Based on the gate library, the appropriate type of gate is selected to implement a data path, as shown in Fig. 2. To satisfy the performance requirements of different applications, a PNS algorithm is introduced based on the multidimensional multiple-choice 0-1 Knapsack problem (MMKP) [9], [18]. The delay of the critical path determines the performance of the data path. The gates in the critical path are required to meet the performance constraint, while the gates in the noncritical paths use a low power version to enhance power efficiency.

The proposed PNS algorithm behaves as follows. Assume n gates are in the critical path. In the gate library, each gate has two types. The delay and power of each type of gate are expressed as, respectively,  $D_{ij}(V_{th}, V_{dd})$  and  $P_{ij}(V_{th}, V_{dd})$ (i = 1, 2, ..., n, j = 1, 2).  $D_{ij}(V_{th}, V_{dd}) > 0$  and  $P_{ij}(V_{\text{th}}, V_{\text{dd}}) > 0$ . Consider the delay constraint  $D_c(D_c > 0)$ . n gates (each gate is selected from two types of gates) need to be determined. There is a 0/1 matrix  $|x_{ij}|, x_{ij} \in \{0, 1\}$  to satisfy  $\sum \sum D_{ij}(V_{th}, V_{dd})x_{ij} \leq D_c$  and achieve the minimum value of  $\sum \sum P_{ij}(V_{th}, V_{dd})x_{ij}$ . Accordingly, a PNS can be formulated as an MMKP [18]

min Max 
$$\sum_{i=1}^{n} \sum_{j=1}^{2} P_{ij}(V_{\text{th}}, V_{\text{dd}}) x_{ij}$$
 (3)

min Max 
$$\sum_{i=1}^{n} \sum_{j=1}^{2} P_{ij}(V_{\text{th}}, V_{\text{dd}}) x_{ij}$$
 (3)  
s.t.  $\sum_{i=1}^{n} \sum_{j=1}^{2} D_{ij}(V_{\text{th}}, V_{\text{dd}}) x_{ij} \le D_c$ ,  
 $x_{ij} \in \{0, 1\}, i \in [1, n], j \in [1, 2]$ . (4)

Similar to MMKP, a PNS can be solved using a dynamic programming approach [18], branch and bound approach [19], as well as recent Graphics Processing Unit-based approaches [20].

#### C. Flexible Charge Recycling Technique

A key design issue in low power data paths is exploring the choice of different power efficient n-type and p-type gates. Accordingly, the FCR is proposed to achieve high power efficiency, as shown in Fig. 3.

Consider a critical data path with two cascaded gates. The initial gate is an n-type dynamic gate, while the latter gate is a p-type dynamic gate. During the precharge stage (CLK = 0), the dynamic node of the n-type gate  $N_n$  is precharged to  $V_{\rm dd}$  through transistor  $P_{\rm cn}$ , while the dynamic node of the p-type gate  $N_p$  is discharged to ground through transistor  $N_{cp}$ . In the evaluation stage, provided that the necessary input combination is applied,  $N_n$  is discharged to ground and  $N_p$  is charged to  $V_{dd}$ . Otherwise, the high state of  $N_n$  and low state of  $N_p$  are maintained until the next precharge stage. As the evaluation process completes,  $N_n$  discharges from high to low and  $N_p$  charges from low to high. In the following precharge stage,  $N_n$  and  $N_p$  both consume dynamic power by charging  $N_n$  from  $V_{dd}$  and discharging  $N_p$  to ground. If a switch is inserted between the two dynamic gates,  $N_n$  is charged by  $N_p$  through a charge recycling path, thereby reducing the dynamic power. Toward this direction, a zipper dynamic full adder is taken as an example, as shown in Fig. 3. When the input vectors of full adder are respectively (1, 1, 0), (1, 0, 1), and (0, 1, 1), at the end of an evaluation stage,  $N_n$  has been discharged to Gnd while  $N_p$  has been charged to  $V_{dd}$ , and the switch is turned on. And then, a desirable charge recycling path between  $N_n$  and  $N_p$  is built. The voltage waveforms of  $N_n$  and  $N_p$ , when full adders are without and with FCR cell, are shown in Fig. 4. With the FCR cell, in the precharge stage, the CLKB makes the recycle path available. Consequently, two supplies  $V_{dd}$  and  $N_p$  charge  $N_n$  simultaneously, which makes the precharge speed much higher as compared with the conventional circuit with only single supply  $V_{dd}$  charging  $N_n$ . As also observed in Fig. 4, the additional capacitance  $C_r$ between dynamic nodes  $N_n$  and  $N_p$  due to an adding charge



Fig. 3. FCR technique.



Fig. 4. Waveforms of CLK and dynamic nodes without and with FCR cell. (a) Waveform of CLK and (b) Waveforms of dynamic nodes without and with FCR cell.

recycling path has an negligible effect on evaluation speed. This is because, compared with  $C_p$  and  $C_n$ ,  $C_r$  is much smaller. Accordingly, in the evaluation stage, the voltage waveforms of  $N_n$  and  $N_p$  without and with the FCR cell almost overlap, as shown in Fig. 4.

To exploit the FCR, the charge recycling paths can be inserted between two independent gates as well as the two neighboring gates, as shown in Fig. 3. Accordingly, if r p-type gates and q n-type gates are selected for a critical path,  $\min(r,q)$  charge recycling paths can be inserted to reduce power. Assuming that there is no charge recycling path in uncritical path, the power reduction factor  $\eta$  can be expressed as

$$\eta = \frac{\sum_{i=1}^{\min(r,q)} \alpha_i P_i^{\text{re}}(V_{\text{th}}, V_{\text{dd}})}{\sum_{i=1}^n \sum_{j=1}^2 P_{ij}^{\text{cr}}(V_{\text{th}}, V_{\text{dd}}) x_{ij} + \sum_{i=1}^m P_i^{\text{uc}}(V_{\text{th}}, V_{\text{dd}})}$$
(5)

where  $\alpha_i$  is the power reduction factor of every two n-p gates (an n-type and p-type) within the charge recycling path.  $P_i^{\text{re}}(V_{\text{th}}, V_{\text{dd}})$ ,  $P_i^{\text{uc}}(V_{\text{th}}, V_{\text{dd}})$ , and  $P_{ij}^{\text{cr}}(V_{\text{th}}, V_{\text{dd}})$ 

are, respectively, the power consumed by every two gates without the charge recycling path, the power consumption of each gate within a noncritical path, and the power consumption of each gate within a critical path.

#### D. Power Efficiency of PNS-FCR

An analytic model is provided for the power reduction factor  $\alpha_i$ . A couple of dynamic gates (one n-type dynamic gate and one p-type dynamic gate) are taken as an example, as shown in Fig. 3. During a clock cycle (including the evaluation and precharge phases), the total energy  $E_t$ , dissipated by one n-type dynamic gate and one p-type dynamic gate, is

$$E_t = C_p V_{\rm dd}^2 + C_n V_{\rm dd}^2. \tag{6}$$

Once a charge recycling path is determined, at the end of the evaluation phase,  $C_p$  is charged to  $V_{\rm dd}$  and  $C_n$  is discharged to ground. The precharge phase arrives and the charge recycling process is enabled.  $C_n$  is charged by  $C_p$  until the voltage of  $N_p(V_p)$  and  $N_n(V_n)$  reaches  $V_p = V_n + V_{\rm th}$ ,

as shown in Fig. 3. The charge on  $C_p$  and  $C_n$  are, respectively,  $Q_p$  and  $Q_n$  [21]. Accordingly, the total charge Q is

$$Q = C_p V_{dd} = Q_n + Q_p = C_p V_p + C_n V_n$$
 (7)

$$Q = C_p V_{dd} = C_p (V_n + V_{th}) + C_n V_n.$$
 (8)

From (7) and (8),  $V_n$  and  $V_p$  are, respectively

$$V_n = \frac{C_p}{C_n + C_p} (V_{\rm dd} - V_{\rm th}) \tag{9}$$

$$V_p = \frac{C_p}{C_n + C_p} V_{\text{dd}} + \frac{C_n}{C_n + C_p} V_{\text{th}}.$$
 (10)

Thus, the energy dissipated by the n-type gate  $(E_n^{re})$  and p-type gate  $(E_p^{re})$  within a clock cycle is, respectively

$$E_n^{\text{re}} = \frac{C_n V_{\text{dd}}^2}{2} + \frac{V_{\text{dd}}}{2} \int_{V_n}^{V_{\text{dd}}} C_n dV$$
 (11)

$$E_p^{\text{re}} = \frac{C_p V_{\text{dd}}^2}{2} + \frac{V_{\text{dd}}}{2} \int_0^{V_p} C_p dV.$$
 (12)

In addition, when the recycling path works, the energy dissipated to switch ON and OFF the transistor M5 and M6 (as shown in Fig. 3) is  $E_r = C_r (V_{\rm dd})^2$ .  $C_r$  is the equivalent capacitance at the gates of M5 and M6. If  $C_r$  is expressed as a function of  $C_p$  and  $C_r = tC_p$ , t is determined by the sizes of M5 and M6. What is more, because of the charge redistribution effect, the parasitic capacitance of the charge recycling path to the grounded substrate,  $C_{\rm pr}$ , will increase the static energy dissipation. In the Appendix, the energy dissipation is proved to be  $E_{\rm pr} = C_{\rm pr} (V_{\rm dd})^2$ . Assume  $C_{\rm pr}$  is a function of  $C_p$  and it can be expressed as  $C_{\rm pr} = wC_p$ . Here, w is also dependent on the sizes of M5 and M6.

Summing the energy of (11), (12),  $E_r$ , and  $E_{pr}$ , the total energy dissipated ( $E_t^{re}$ ) within a clock cycle when a recycling path is applied is

$$E_t^{\text{re}} = E_n^{\text{re}} + E_p^{\text{re}} + E_r + E_{\text{pr}}$$
 (13)

where

$$E_{t}^{\text{re}} = \frac{C_{p}^{2} + C_{n}^{2} + C_{p}C_{n}}{C_{p} + C_{n}} V_{\text{dd}}^{2} + \frac{C_{p}C_{n}}{C_{p} + C_{n}} V_{\text{dd}} V_{\text{th}}$$

$$+ C_{r}V_{\text{dd}}^{2} + C_{\text{pr}}V_{\text{dd}}^{2}$$

$$= \frac{C_{p}^{2} + C_{n}^{2} + C_{p}C_{n}}{C_{p} + C_{n}} V_{\text{dd}}^{2} + \frac{C_{p}C_{n}}{C_{p} + C_{n}} V_{\text{dd}} V_{\text{th}}$$

$$+ (t + w)C_{p}V_{\text{dd}}^{2}.$$

Thus, from (6) and (14), the energy reduction factor  $\alpha_i$  is provided by

(7) 
$$\alpha_i = \frac{E_t - E_t^{\text{re}}}{E_t} = \frac{C_p C_n}{(C_p + C_n)^2} \left( \frac{V_{\text{dd}} - V_{\text{th}}}{V_{\text{dd}}} \right) + \frac{(t+w)C_p}{(C_p + C_n)}.$$
(8)

Equation (15) shows that the energy reduction is maximized when  $C_p = C_n$ , and  $\alpha_i$  is more effective for small  $V_{\rm th}$ . For example, in a 45-nm CMOS technology with  $C_p = C_n$ ,  $V_{\rm dd} = 1$  volts, and  $V_{\rm th} = 0.22$  volts [22],  $\alpha_i = 19.5\% + 0.5(t + w)$ .

Summing (5) and (15), the power reduction factor  $\eta$  is (16), as shown at the bottom of the page.

Consider the case where  $C_p = C_n$ , (16) takes the form in (17), as shown at the bottom of the page.

Note that the delay penalty of the evaluation stage due to inserting a switch is <2% [21]. The reason is shown in Fig. 4 and discussed in Section III-C.

## E. Design Flow for a Data Path Based on FNS-FCR

As shown in Fig. 2, the design flow for a data path based on FNS-FCR is as following:

- 1) First, the gate library based on a p-type/n-type dynamic circuit is built. Two types of each gate occupy similar layout area to avoid the area penalty.
- 2) Based on the gate library, the appropriate type of gates is selected using PNS to implement the data path or critical path, satisfying the performance requirements of different applications.
- 3) Next, the FCR is utilized to achieve high power efficiency in critical path by inserting the charge recycling paths between two independent gates or two neighboring gates. Note that the FCR is a tradeoff between power, performance, and silicon
- 4) Then, apply the proposed PNS-FCR to noncritical paths. The critical path is typically much longer than uncritical path in the data path, and therefore, the gates in the uncritical path employ p-type for power efficiency. However, if an uncritical path formed by all p-type gates is even slower than the critical path, n-type gates would be inserted to meet the delay constraint based on PNS, and then the FCR is used to enhance the power efficiency.
- 5) Finally, the routing is completed manually or by CAD tools.

Note that, with PNS-FCR, the potential implemented design depends on available gates in the library, and therefore, the

$$\eta = \frac{\left[\frac{C_p C_n}{(C_p + C_n)^2} \left(\frac{V_{\text{dd}} - V_{\text{th}}}{V_{\text{dd}}}\right) + \frac{(t + w)C_p}{(C_p + C_n)}\right] \min(r, q) \sum_{i=1}^{\min(r, q)} P_i^{\text{re}}(V_{\text{th}}, V_{\text{dd}})}{\sum_{i=1}^n \sum_{j=1}^2 P_{ij}^{\text{cr}}(V_{\text{th}}, V_{\text{dd}}) x_{ij} + \sum_{i=1}^m P_i^{\text{uc}}(V_{\text{th}}, V_{\text{dd}})}$$
(16)

(14)

$$\eta = \frac{\left[0.25\left(1 - \frac{V_{\text{th}}}{V_{\text{dd}}} + 2t + 2w\right)\right] \min(r, q) \sum_{i=1}^{\min(r, q)} P_i^{\text{re}}(V_{\text{th}}, V_{\text{dd}})}{\sum_{i=1}^n \sum_{j=1}^2 P_{ij}^{\text{cr}}(V_{\text{th}}, V_{\text{dd}}) x_{ij} + \sum_{i=1}^m P_i^{\text{uc}}(V_{\text{th}}, V_{\text{dd}})}$$
(17)

| Tech. node | $\alpha_i$ | No. of CR paths $(N_{CR})$ | Size of CR<br>(Size of M5 and M6) | Speed improvement | Area penalty |
|------------|------------|----------------------------|-----------------------------------|-------------------|--------------|
| 65 nm      | 11.7%      | 32                         | W/L=20, L=Minimum                 | 4.05%             | 4.9%         |
| 45 nm      | 7.3%       | 32                         | W/L=15, L= Minimum                | 3.03%             | 3.4%         |
| 32 nm      | 3.3%       | 32                         | W/L=10, L= Minimum                | 0.43%             | 2.3%         |

TABLE I  $\alpha_i$  and Delay Penalty in a 32-bit Ripple Carry Adder

designers need to adapt the gate library for target applications. In this paper, the function block is design based on dynamic circuit, the gate library, therefore, includes two kinds of gates: 1) p-type dynamic gate and 2) n-type dynamic gate. The proposed PNS-FCR can be extended to include multiple logic styles. For example, if the gates in the library are designed in different logic styles, such as static CMOS, pass gate [26], transmission gate [26], tristate gate [27], n-type, and p-type dynamic logic, the gate selection will be performed among different logic styles to enhance the power efficiency while meeting the performance requirement and then the FCR will be applied between the selected n-type and p-type dynamic gates to further improve the power efficiency. In such extended application condition, the effective transmission and interaction of signals (such as CLK) between different logic styles is a major implementation consideration.

## IV. EXPERIMENT RESULTS

## A. Sizing Methodology

The size of the dynamic circuits is an important issue with PNS-FCR. Due to the tight delay constraints, the range of sizing is narrow in dynamic circuits [23]. Since the evaluation path is the critical path that determines the access time of a dynamic circuit, transistor sizing requires excessive care due to performance concerns.

- The width of the transistors in the PDN is determined by the method of logical effort [28] and the output static inverter is skewed to achieve a fast evaluation speed.
- 2) Sizing the footer and keeper requires careful balance among the application-specific access time, noise margin, and power consumption. Because the width of the footer simultaneously influences the evaluation speed and clock load, it is typically in the range of one to four times. The keeper size is determined by the keeper ratio (K), as described by (18). As K increases, due to the large contention current generated by the strong keeper, the noise immunity is improved while increasing the access time and power consumption. Therefore, to provide fast evaluation speed with a reasonable noise margin, the keeper size is restricted to satisfy the condition 0.1 < K < 1 [37].

In the proposed gate library, considering the sensitivity of dynamic circuits to noise, leakage current, and charge sharing, the keeper is sized with K = 0.5. In addition, the footer width is set equal to the width of the transistors in the PDN

$$K = \frac{\mu_p(W/L)_{\text{keeper}}}{\mu_n(W/L)_{\text{PND}}}.$$
 (18)

In the following sections, the effectiveness of FCR in reducing power consumption is evaluated on a 32-bit adder. The power efficiency of PNS-FCR is evaluated on ALU benchmarks selected from ISCAS85 [29], [30] and 74X-Series benchmark suites [30]. The effectiveness of PNS-FCR on a 0.35- $\mu$ m CMOS test circuit is also evaluated.

#### B. Verification of FCR

To verify the effectiveness of FCR, a 32-bit ripple carry adder with clock-delay [35], which is usually employed along the critical path in data path, operating at 1 GHz for three different deep submicrometer technologies (65, 45, and 32 nm) has been evaluated. Each full adder with FCR includes one n-type Sum cell, one p-type Carry cell (zipper dynamic full adder), and one FCR cell to enhance power efficiency (Fig. 3). Accordingly, there are 32 charge recycling paths in a 32-bit ripple carry adder.

The simulation results are listed in Table I. According to (15),  $\alpha_i$  is influenced by the device size of the charge recycling path. Therefore, the power distribution method [32] is used to find the optimal size of transistors in the charge recycle path for the most effective FCR. As shown in Table I, as W/L of M5 and M6 in the charge recycling path are 10, 15, and 20,  $\alpha_i$ , respectively, achieves 11.7%, 7.3%, and 3.3% in three deep submicrometer technologies. Obviously, the effectiveness of FCR is influenced by technology scaling. This behavior occurs because, in a scaled technology,  $V_{\rm dd}$  is reduced to maintain dynamic power within acceptable levels [24]. To satisfy performance requirements,  $V_{\rm th}$  and the gate oxide thickness  $(t_{\rm ox})$  of the transistors are also reduced as  $V_{\rm dd}$  is lowered, leading to exponential growth in subthreshold and gate leakage currents [24]. As a result, the leakage power accounts for a larger proportion of the total power consumption. The effectiveness of FCR, which primarily reduces switching power, degrades with technology scaling. Despite this characteristic, for a 32-nm ripple carry adder, the power savings of 3.3% are achieved as compared with conventional dynamic circuits.

As discussed in Section III-C, the FCR leads a small delay penalty on the evaluation stage, but it greatly improves the speed in the precharge stage. Accordingly, the FCR enhances the overall speed of the adders by up to 4.05%, as listed in Table I. In addition, due to the switch insertion, silicon area increases with the number of the charge recycling path. The silicon area of adders is taken as the total transistor width of the circuit [36]. As for 32-bit ripple carry adders in three deep submicrometer technologies, there are totally 32 charge recycling paths and the area overhead are, respectively, 4.9%, 3.4%, and 2.3%. Fig. 5 shows the layout of



Fig. 5. Layout of a 32-bit ripple carry adder based on 65-nm SMIC technology.



Fig. 6. n-type dynamic gate and p-type dynamic gate with/without FCR. (a) Layout. (b) Schematic.

32-bit ripple adder based on the Semiconductor Manufacturing 65-nm International Corporation (SMIC) technology. The adder is formed by 32 zipper dynamic full adders that include Carry cell, FCR cell, and Sum cell (Fig. 3). The FCR cell consumes a small portion layout of the entire adder. Note that Metal4 is particularly used for clock routing and it is set at the fixed position of each gate, as shown in Fig. 5. This is because, for a standard digital circuit, all of the gates are designed with

the same height and they can be placed in parallel to achieve the connection of power line and ground line. At the same time, clock signals can be connected easily by extending the CLK and CLKB wires or using metal4 and via, without an additional complex routing work.

Note that, the n-type dynamic gate and p-type dynamic gate include input and out pins, and the position information of FCR. This position  $(N_n \text{ and } N_p)$  is just the metal used as dynamic nodes inside both n-type and p-type dynamic



Fig. 7. Voltage waveforms of  $N_n$  and  $N_p$  considering clock skew. (a) Voltage waveforms of (a) CLK, (b) CLKB, (c) Nn, and (d) Np without and with clock skew.

gates, which is, respectively, illustrated in layout and schematic perspectives in Fig. 6(a) and (b). Accordingly, even if no FCR cell added, there is no floating metal left in the layout.

Based on the above analysis, the FCR is a tradeoff between power consumption, performance, silicon area, which is required to meet the design constraints of target applications. In particular, if the power consumption and speed are top design priority, FCR is preferred to achieve high power efficiency and high performance; alternatively, if the area constraint is tight with large power budget, the area overhead induced by FCR would be a major consideration, and therefore, FCR can be skipped over in the design flow.

Furthermore, since the working process of an adder strongly depends on CLK and CLKB, the clock skew may influence the timing characteristics of the entire circuit. To evaluate the impact of the clock skew, a 45-nm zipper dynamic full adder (Fig. 3) is taken as an example with 1 GHz CLK and input vectors, respectively, (1, 1, 0), (1, 0, 1), and (0, 1, 1). The results are shown in Fig. 7. It shows that the clock skew increases the delay time of the circuit, especially it induces the large precharge delay penalty of  $N_n$  and  $N_p$ . This is because, the inserted charge recycling paths are controlled by CLKB (Fig. 3). In the precharge stage, CLKB keeps 0 due to the clock skew, which cuts off the charge recycling path and influences the speedup benefits offered by the proposed FCR. In the worst condition with an extremely large clock skew, there is no enough time for  $N_p$  to finish the precharge and evaluation process, resulting in logic error.

#### C. Application of PNS-FCR to ALU

A set of ALU benchmarks from the ISCAS85 and 74X-Series suites is used to verify the effectiveness and design flow of PNS-FCR. In modern microprocessors,



Fig. 8. 8-bit ALU.

the ALU is typically partitioned into functional modules and control blocks. To simplify the analysis, a standard ALU, as illustrated in Fig. 8, is used. The functional modules perform Boolean logic (such as AND, OR, and XOR) or arithmetic operations (such as Add and Shift), and are highly regular [25]. The delay of an ALU is determined by the critical paths. In an ALU, the adder is usually slower than other modules, particularly a ripple carry adder. ALUs with ripple carry adders operating at speeds ranging from 100 MHz to 1 GHz with a 3.3 volt supply voltage have been developed in a  $0.35-\mu m$  Global Foundries technology [31]. The delay  $D_{ij}(V_{th}, V_{dd})$  and power  $D_{ij}(V_{th}, V_{dd})$  are characterized in the gate library.

As compared with conventional ALUs, the normalized power savings of a 4-, 8-, 9-, and 12-bit ALU with the proposed technique for various conditions is shown in Fig. 9. The maximum operating frequency of the ALU is lower with longer data paths. For example, a 4-bit ALU can



Fig. 9. Normalized power savings of ALUs with PNS-FCR.

achieve 1 GHz. The 8- and 9-bit ALUs operate successfully under 500 MHz, and the maximum frequency of a 12-bit ALU is only 300 MHz. As also shown in Fig. 9, the power savings of the ALUs decrease gradually with an increasing speed. An interesting observation is that the 4-bit ALU, operating between 100 and 300 MHz, saves the same power. This behavior occurs because the p-type dynamic gates achieve the highest power efficiency operating at 300 MHz. Therefore, an ALU operating between 100 and 300 MHz all adopt p-type dynamic circuits. Similarly, for 8- and 9-bit ALUs, the circuits exhibit the same power efficiency at frequencies between 100 and 200 MHz.

Another observation illustrated in Fig. 9 is that the effect of the performance requirements on the power savings of a 4-bit ALU is more significant than other ALU bit widths. This behavior is because fewer gates along the critical path of a 4-bit ALU consume less power, which accounts for a lower proportion of the total power consumption. As shown in Fig. 9, over 41% power savings can be achieved in a 4-bit ALU with PNS-FCR. In particular, for a 100-MHz ALU with different bit widths, PNS-FCR can realize power reductions of up to 60%.

The type of full adder in the critical path of an ALU with PNS-FCR is listed in Table II. Each full adder consists of a Sum cell and a Carry cell. With one n-type and one p-type dynamic full adder in the critical path, two charge recycling paths can be applied to four cells. Considering the zipper dynamic full adder shown in Fig. 3, one charge recycling path can be inserted between the Sum cell and Carry cell. The number of charge recycling paths ( $N_{\rm CR}$ ), determined by the available n-type, p-type, and zipper modules, is

$$N_{\rm CR} = 2 * \min(\text{num}(N), \text{num}(P)) + \text{num}(\text{Zipper})$$
 (19)

where num(n), num(p), and num(zipper) are, respectively, the number of n-type, p-type, and zipper modules. For example, for a 12-bit ALU operating at 200 MHz, the critical path

consists of six p-type, one zipper, and five n-type dynamic full adders.  $N_{\rm CR}$  is therefore 11.

As also observed in Table II, N<sub>CR</sub> depends on the performance requirements of the applications. For low frequency applications, all of the ALUs employ p-type dynamic circuits along the critical path. With an increasing performance, more n-type dynamic circuits are added to the critical path and more charge recycling paths are inserted to improve power efficiency, as indicated by (19). N<sub>CR</sub>, therefore, continues to increase until the maximum frequency is achieved. N<sub>CR</sub> decreases with an increasing frequency due to fewer available p-type dynamic circuits. As a 4-bit ALU operates at 1 GHz, the modules are all n-type and no charge recycling path is added, as listed in Table II. Note that, for wider ALUs, the design flow of PNS-FCR is the same as 8-12-bit ALUs. Their only difference is the width of the critical path. In addition, in wider ALUs, the critical path usually passes through its 32-bit adder, which has been discussed in Section IV-B.

#### D. Overall Effectiveness of PNS-FCR

For a more comprehensive analysis of PNS-FCR, design complexity should be considered. Based on (19) and Table II, the  $N_{\rm CR}$  for different ALUs ranges from 0 to 11. A larger  $N_{\rm CR}$  indicates more charge recycling paths, leading to greater design complexity and lower design efficiency. Accordingly, the design efficiency of an ALU with PNS-FCR is  $1-N_{\rm CR}/12$ . The design efficiency has 12 levels from 1/12 to 12/12. For example, a 12-bit ALU operating at 200 MHz requires 11 charge recycling paths, with the highest design complexity and lowest design efficiency  $(1-N_{\rm CR}/12=1/12)$ . For a 4-bit ALU operating at 1 GHz, no charge recycling paths are employed, and therefore, the design complexity is lowest and the design efficiency is  $1-N_{\rm CR}/12=12/12=1$ . A new performance metric is, therefore, introduced to characterize the overall effectiveness factor ( $\lambda$ ) of an ALU with PNS-FCR

$$\lambda = f \cdot \underbrace{\left(1 - \frac{N_{\text{CR}}}{12}\right)}_{\text{design\_efficiency}} \cdot W_{\text{ALU}} \cdot \eta \tag{20}$$

where f,  $W_{ALU}$ , and  $\eta$  are, respectively, the operating frequency, the data width, and the power reduction factor of an ALU.  $\lambda$  for different ALUs are listed in Table III. Note that the 4-bit ALU operating at 1 GHz, an 8-bit ALU operating at 500 MHz, and a 9-bit ALU operating at 500 MHz achieve the highest  $\lambda$ . Since an 8-bit data path is widely applied in practical digital circuits, an 8-bit ALU operating at 500 MHz is taped out and discussed in the following section. The overall effectiveness factor ( $\lambda$ ) does not consider the silicon area and the clock routing, because both silicon area penalty and clock routing overhead due to PNS-FCR are much small and can be negligible, as discussed in Section IV-B.

#### E. ALU IC With PNS-FCR

An 8-bit ALU IC operating at 500 MHz has been manufactured in a 0.35- $\mu$ m Global Foundries technology. A microphotograph and the test Printed Circuit Board (PCB)

| TABLE II                                                  |  |
|-----------------------------------------------------------|--|
| TYPE OF GATE ALONG THE CRITICAL PATH OF ALUS WITH PNS-FCR |  |

| Frequency | Gate       | 4-bit ALU    | 8-bit ALU    | 9-bit ALU    | 12-bit ALU    |
|-----------|------------|--------------|--------------|--------------|---------------|
| 100 MH-   | Full adder | 4P (0)       | 8P (0)       | 9P (0)       | 12P (0)       |
| 100 MHz   | MUX        | P (0)        | P (0)        | P (0)        | P (0)         |
| 200 MH-   | Full adder | 4P (0)       | 8P (0)       | 8P, 1N (2)   | 6P,1Z,5N (11) |
| 200 MHz   | MUX        | P (0)        | P (0)        | N (0)        | N (0)         |
| 200 MH    | Full adder | 4P (0)       | 2P,4Z,2N (8) | 3P,1Z,5N (7) | 2P, 10N (4)   |
| 300 MHz   | MUX        | P (0)        | P (0)        | N (0)        | N (0)         |
| 500 MH    | Full adder | 2P,1Z,1N (3) | 1P, 7N (2)   | 1Z, 8N (1)   |               |
| 500 MHz   | MUX        | N (0)        | N (0)        | N (0)        |               |
| 000 3 411 | Full adder | 1P, 3N (2)   | -            |              |               |
| 800 MHz   | MUX        | N (0)        |              |              |               |
| 1.011     | Full adder | 4N (0)       |              |              |               |
| 1 GHz     | MUX        | N (0)        |              |              |               |

P: P-type dynamic gate; N: N-type dynamic gate; Z: Zipper dynamic gate (n): number of charge recycling paths

 $\label{thm:constraint} TABLE~III$  Overall Effectiveness Factor of ALUs With PNS-FCR

| Frequency | 4-bit ALU | 8-bit ALU | 9-bit ALU | 12-bit ALU |
|-----------|-----------|-----------|-----------|------------|
| 100 MHz   | 240       | 480       | 540       | 720        |
| 200 MHz   | 480       | 960       | 900       | 116        |
| 300 MHz   | 720       | 448       | 664       | 1368       |
| 500 MHz   | 810       | 1800      | 2393      |            |
| 800 MHz   | 1173      |           |           |            |
| 1 GHz     | 1679      |           |           |            |



Fig. 10. (a) Microphotograph of conventional 8-bit ALU (ALU\_N) and ALU based on PNS-FCR (ALU\_PNS-FCR) and (b) Test PCB.

of the conventional ALU based on n-type dynamic circuits (ALU\_N) and an ALU with the proposed PNS-FCR (ALU\_PNS-FCR) is shown in Fig. 10. The characteristics

TABLE IV
CHARACTERISTICS OF THE TEST CHIP

| Technology                           | 0.35 μm Global Foundries |
|--------------------------------------|--------------------------|
| Supper Power                         | 3.3 V                    |
| Die Size                             | 1.87 mm*1.87 mm          |
| Package Type                         | LQFP64                   |
| Clock Frequency                      | 500 MHz                  |
| No. of IO                            | 60                       |
| Power Saving (ALU_PNS-FCR vs. ALU_N) | 31%                      |
|                                      |                          |

of the test chip listed in Table IV. The ALU\_ PNS-FCR occupies similar area to a conventional ALU, and the IOs of the two ALUs are placed at the same position, thereby providing convenient port-to-port exchange. Due to the effects of the IOs, wires, and other peripheral circuits, the effectiveness of PNS-FCR is lower. Despite this situation, the measurement results demonstrate that the ALU\_PNS-FCR reduces power consumption by 31% as compared with the ALU\_N, validating the ability of PNS-FCR to save power.

# V. CONCLUSION

A novel methodology is presented in this paper for designing dynamic circuits in the functional units of modern processors. The proposed PNS-FCR methodology achieves high power efficiency, while satisfying specific timing constraints. The methodology has been validated on ISCAS85 and 74X-Series benchmark circuits. Simulation results show that the power consumption of a 4-, 8-, 9-, and 12-bit ALU can be reduced by 41% to 60% operating at different frequencies as compared with a conventional ALU.



Fig. 11. Capacitance and voltage changes at the dynamic node of the n-type gate  $N_n$ .

In addition, a comprehensive suite of simulations is performed to evaluate the effects of technology scaling, data path width, design complexity, clock skew, and different application conditions. Finally, an 8-bit ALU IC manufactured in a 0.35-µm Global Foundries technology validates the power and area efficiency of PNS-FCR. This methodology can be extended to static CMOS, pass gate, transmission gate, tristate gate, and other logic families.

# APPENDIX STATIC ENERGY DISSIPATION OF THE CHARGE RECYCLING PATH

As shown in Fig. 3, when the charge-recycling path is inserted between two independent gates or the two neighboring gates, the parasitic capacitance of the charge recycling path to the grounded substrate,  $C_{pr}$ , is added between node  $N_n$ and ground, and increase the static energy dissipation. The capacitance model in Fig. 11 indicates the capacitance and voltage changes at the dynamic node of the n-type gate,  $N_n$ . The charge redistribution at  $N_n$  is as follows. In the beginning,  $N_n$  is stable at  $V_{dd}$ , the charge at  $N_n$  is  $Q_n = C_n V_{dd}$ ; after the charge recycling path is inserted, the charge on  $C_n$  is redistributed on  $C_n$  and  $C_{pr}$ , respectively, and the voltage of the node  $N_n$  is  $V_m$  then

$$Q_n = C_n V_{\rm dd} = C_n V_m + C_{\rm pr} V_m \tag{21}$$

$$Q_n = C_n V_{\text{dd}} = C_n V_m + C_{\text{pr}} V_m$$

$$V_m = \frac{C_n V_{\text{dd}}}{C_n + C_{\text{pr}}}.$$
(21)

Finally, in order to make  $N_n$  keep original logic 1,  $V_{\rm dd}$  recharges  $N_n$  by compensation current through the keeper transistor, which produces static energy dissipation  $E_{pr}$ , thus

$$E_{\rm pr} = \int_{0}^{\infty} i_{\rm dd}(t) V_{\rm dd} dt = V_{\rm dd} \int_{0}^{\infty} (C_n + C_{\rm pr}) \frac{dV}{dt} dt$$
$$= V_{\rm dd}(C_n + C_{\rm pr}) \int_{V_m}^{V_{\rm dd}} dV = V_{\rm dd}(C_n + C_{\rm pr})(V_{\rm dd} - V_m). \tag{23}$$

From (22) and (23),  $E_{pr}$  is derived as

$$E_{\rm pr} = C_{\rm pr} V_{\rm dd}^2. \tag{24}$$

#### REFERENCES

- [1] R. Riedlinger et al., "A 32 nm, 3.1 billion transistor, 12 wide issue Itanium processor for mission-critical servers," IEEE J. Solid-State Circuits, vol. 47, no. 1, pp. 177-193, Jan. 2012.
- [2] J. L. Shin et al., "A 40 nm 16-core 128-thread SPARC SoC processor," IEEE J. Solid-State Circuits, vol. 46, no. 1, pp. 131-144, Jan. 2011.
- [3] Kathy-Farrel. (Dec. 11, 2012). Intel Xeon Processor E5-2600/4600 Product Family Technical Overview. [Online]. Available: https://software.intel.com/en-us/articles/intel-xeon-processor-e5-26004600-product-family-technical-overview
- Chrisshore. (Oct. 17, 2013). ARMv7-A—Power to the People. [Online]. Available: http://community.arm.com/docs/DOC-7303
- [5] NXP. (Sep. 3, 2013). Robust Capacitive Touch Switches Survive Harsh Environments. [Online]. Available: http://www.nxp.com/news/pressreleases/2013/09/robust-capacitive-touch-switches-survive-harshenvironments.html
- [6] J. L. Shin et al., "The next-generation 64b SPARC core in a T4 SoC processor," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2012, pp. 60-62.
- [7] H. McIntyre et al., "Design of the two-core x86-64 AMD 'Bulldozer' module in 32 nm SOI CMOS," IEEE J. Solid-State Circuits, vol. 47, no. 1, pp. 164-176, Jan. 2012.
- M. Golden, S. Arekapudi, and J. Vinh, "40-entry unified out-of-order scheduler and integer execution unit for the AMD Bulldozer x86-64 core," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2011, pp. 80-82.
- [9] C. Gopalakrishnan and S. Katkoori, "KnapBind: An area-efficient binding algorithm for low-leakage datapaths," in Proc. 21st Int. Conf. Comput. Design, Oct. 2003, pp. 430-435.
- M. Nemani and V. Tiwari, "Macro-driven circuit design methodology for high-performance datapaths," in Proc. ACM/IEEE Design Autom. Conf., Jun. 2003, pp. 661-666.
- [11] K.-W. Kim, T. Kim, C. L. Liu, and S.-M. S. Kang, "Domino logic synthesis based on implication graph," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 21, no. 2, pp. 232–240, Feb. 2002.
- G. De Micheli, "Performance-oriented synthesis of large-scale domino CMOS circuits," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 6, no. 5, pp. 751-765, Sep. 1987.
- [13] P. Patra, U. Narayanan, and T. Kim, "Phase assignment for synthesis of low-power domino circuits," Electron. Lett., vol. 37, no. 13, pp. 814-816, Jun. 2001.
- [14] Y.-Y. Liu and T. Hwang, "Crosstalk-aware domino-logic synthesis," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 6, pp. 1115-1161, Jun. 2007.
- [15] T. J. Thorp, G. S. Yee, and C. M. Sechen, "Design and synthesis of dynamic circuits," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 1, pp. 141-149, Feb. 2003.
- [16] A. Chowdhary and R. K. Gupta, "A methodology for synthesis of data path circuits," IEEE Des. Test Comput., vol. 19, no. 6, pp. 90-100, Nov./Dec. 2002.
- Z. Liu and V. Kursun, "PMOS-only sleep switch dual-threshold voltage domino logic in sub-65-nm CMOS technologies," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 12, pp. 1311-1319, Dec. 2007.
- [18] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack Problems. Berlin, Germany: Springer-Verlag, 2004.
- M. E. Dyer, N. Kayal, and J. Walker, "A branch and bound algorithm for solving the multiple-choice knapsack problem," J. Comput. Appl. Math., vol. 11, no. 2, pp. 231-249, Oct. 1984.
- [20] B. Suri, U. D. Bordoloi, and P. Eles, "A scalable GPU-based approach to accelerate the multiple-choice knapsack problem," in Proc. Design Autom. Test Eur. Conf. Exhibit. (DATE), Mar. 2012, pp. 1126-1129.
- [21] K. Limniotis, Y. Tsiatouhas, T. Haniotakis, and A. Arapoyanni, "A design technique for energy reduction in NORA CMOS logic," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 12, pp. 2647-2655, Dec. 2006.
- [22] V. Kursun and E. G. Friedman, "Sleep switch dual threshold voltage domino logic with reduced standby leakage current," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 5, pp. 485-496, May 2004.
- [23] J. Wang, N. Gong, L. Hou, X. Peng, S. Geng, and W. Wu, "Low power and high performance dynamic CMOS XOR/XNOR gate design," Microelectron. Eng., vol. 88, no. 8, pp. 2781-2784, Aug. 2011.
- N. Gong, B. Guo, J. Lou, and J. Wang, "Analysis and optimization of leakage current characteristics in sub-65 nm dual  $V_t$  footed domino circuits," Microelectron. J., vol. 39, no. 9, pp. 1149-1155, Sep. 2008.

- [25] K. Myny, E. van Veenendaal, G. H. Gelinck, J. Genoe, W. Dehaene, and P. Heremans, "An 8-bit, 40-instructions-per-second organic micro-processor on plastic foil," *IEEE J. Solid-State Circuits*, vol. 47, no. 1, pp. 284–291, Jan. 2012.
- [26] E. Salman and E. G. Friedman, High Performance Integrated Circuit Design. New York, NY, USA: McGraw-Hill, 2012.
- [27] M.-D. Ko, C.-W. Sohn, C.-K. Baek, and Y.-H. Jeong, "Study on a scaling length model for tapered tri-gate FinFET based on 3-D simulation and analytical analysis," *IEEE Trans. Electron Devices*, vol. 60, no. 9, pp. 2721–2727, Sep. 2013.
- [28] M. Alioto, G. Palumbo, and M. Pennisi, "Understanding the effect of process variations on the delay of static and domino logic," *IEEE Trans.* Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5, pp. 697–710, May 2010.
- [29] M. C. Hansen, H. Yalcin, and J. P. Hayes, "Unveiling the ISCAS-85 benchmarks: A case study in reverse engineering," *IEEE Des. Test Comput.*, vol. 16, no. 3, pp. 72–80, Jul./Sep. 1999.
- [30] J. P. Hayes. (Apr. 6, 2013). ISCAS High-Level Models. [Online]. Available: http://web.eecs.umich.edu/~jhayes/iscas.restore/benchmark.html
- [31] (May 1, 2013). GlobalShuttle. [Online]. Available: http://www.globalfoundries.com/services/globalshuttle
- [32] J. Wang, N. Gong, S. Geng, L. Hou, W. Wu, and L. Dong, "Low power and high performance Zipper domino circuits with charge recycle path," in *Proc. IEEE 9th Int. Conf. Solid-State Integr.-Circuit Technol.*, Oct. 2008, pp. 2172–2175.
- [33] E. Pakbaznia, F. Fallah, and M. Pedram, "Charge recycling in power-gated CMOS circuits," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 27, no. 10, pp. 1798–1811, Oct. 2008.
- [34] E. Pakbaznia, F. Fallah, and M. Pedram, "Charge recycling in MTCMOS circuits: Concept and analysis," in *Proc. 43rd ACM/IEEE Design Autom. Conf.*, Jul. 2006, pp. 97–102.
- [35] T. J. Thorp, G. S. Yee, and C. M. Sechen, "Design and Synthesis of Dynamic Circuits," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 11, no. 1, pp. 141–149, Feb. 2003.
- [36] G. Yang, Z. Wang, and S.-M. Kang, "Low power and high performance circuit techniques for high fan-in dynamic gates," in *Proc. IEEE 5th Int. Symp. Quality Electron. Design*, Mar. 2004, pp. 421–424.
- [37] N. Gong, J. Wang, and R. Sridhar, "Variation aware sleep vector selection in dual V<sub>t</sub> dynamic OR circuits for low leakage register file design," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 61, no. 7, pp. 1970–1983, Jul. 2014.



**Jinhui Wang** (M'13) received the B.E. degree in electrical engineering from Hebei University, Hebei, China, in 2004, and the Ph.D. degree in electrical engineering through a joint USA/China program between the University of Rochester and the Beijing University of Technology, in 2010.

He is currently an Assistant Professor with the Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND, USA. He has authored over 80 publications and six patents in the emerging semiconductor technologies. His

current research interests include low-power, high-performance, and variation-tolerant integrated circuit design, 3-D IC and EDA methodologies, and thermal issue solution in VLSI.



Na Gong (M'13) received the B.E. degree in electrical engineering and the M.E. degree in microelectronics from Hebei University, Hebei, China, and the Ph.D. degree in computer science and engineering from the State University of New York, Buffalo, NY, USA, in 2004, 2007, and 2013, respectively.

She is currently an Assistant Professor with the Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND, USA. Her current research interests include device-circuit-architecture-application co-design for nano-

scale VLSI circuit and system, power efficient and reliable electronics for mobile computing and high performance computing, and emerging memory technologies in computer systems.



**Eby G. Friedman** (S'78–M'79–SM'90–F'00) received the B.S. degree from Lafayette College, Easton, PA, USA, in 1979, and the M.S. and Ph.D. degrees from the University of California, Irvine, CA, USA, in 1981 and 1989, respectively, all in electrical engineering.

He was with Hughes Aircraft Company, Glendale, CA, USA, from 1979 to 1991, rising to the position of the Manager of the Signal Processing Design and Test Department, responsible for the design and test of high performance digital and analog ICs. He has

been with the Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA, since 1991, where he is currently a Distinguished Professor, and the Director of the High Performance VLSI/IC Design and Analysis Laboratory. He is also a Visiting Professor at the Technion—Israel Institute of Technology. His current research interests include high-performance synchronous digital and mixed-signal microelectronic design, and analysis with application to high speed portable processors and low power wireless communications. He has authored almost 500 papers and book chapters, 13 patents, and the author or editor of 16 books in the fields of high-speed and low-power CMOS design techniques, 3-D design methodologies, high-speed interconnect, and the theory and application of synchronous clock and power distribution networks.

Dr. Friedman is the Editor-in-Chief of the Microelectronics Journal, a Member of the Editorial Boards of the Journal of Low Power Electronics and the Journal of Low Power Electronics and Applications, and a member of the Technical Program Committee of numerous conferences. He was the Editor-in-Chief and the Chair of the Steering Committee of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION Systems, the Regional Editor of the Journal of Circuits, Systems and Computers, a member of the Editorial Board of the PROCEEDINGS OF THE IEEE, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, the IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, Analog Integrated Circuits and Signal Processing, and the Journal of Signal Processing Systems, a member of the Circuits and Systems Society Board of Governors, Program and Technical Chair of several IEEE conferences, and a recipient of the IEEE Circuits and Systems 2013 Charles A. Desoer Technical Achievement Award, a University of Rochester Graduate Teaching Award, and a College of Engineering Teaching Excellence Award. He is a Senior Fulbright Fellow.