NPCPL : Normal Process Complementary Pass Transistor Logic for Low Latency, High Throughput Designs

High throughput and low latency designs are required in modern high performance systems, especially for signal processing applications. Existing logic families cannot provide both of them simultaneously. W e propose a Normal Process Complementary Pass Transistor Logic (NPCPL) which can be used as a univeraal logic t o provide finest grain pipelining without affecting overall latency or increasing the area. It does not require any special process steps and hence, can be Te-alised in a normal process technology as against the CPL proposed by Yano et a1 [2] which uses threshold voltage adjustment of selected devices. The de-sign procedure is described for (a)low latency, (b)high throughput and (c)low area requirements. In addition to the various advantages, it is envisioned that NPCPL designs can also be used to build ultra-high speed pipelined system without pipelining latches, viz., wave pipelined digital systems, where the throughput achievable is beyond that permitted by the delay of a pipeline stage.


Introduction
High performance systems and a variety of realtime Digital Signal Processing systems derive their performance from VLSI solutions. Since fast arit hmetic units are critical to all such high-performance applications, we focus attention to a logic family that realises adders and multipliers for the range of latency and throughput requirements appropriate for a particular DSP application. Addressing this issue, Yano et a1 [2] have offered CPL as a high-speed logic farnily to realise high-peformance arithmetic units. Their approach, however, is constrained by the requirement of a specialised process, which in general may not atlways be easily accessible to a common designer. In this paper we take the cue from CPL and extend the logic family to support a variety of arithmetic and logic units which can be realised in a normal process. One novel feature of our approach is that the proposed logic family can be exploited to support designs with both low latency and high throughput simultaneously. Pipelining can be introduced to the finest grain without any significant area and latency overhead. This is in contrast to the conventional approach to design of high throughput systems where latency and area are traded off for high throughput and vice versa.

Complementary Pass Transistor Logic (CPL)
NMOS pass transistor logic offers advantages of all the three performance metrics of VLSI, viz., area, speed and power dissipation.
In a pass txansistor network, an input is steered through a chain of (n-1) pass transistor under the control of inputs 21, ..., z,+l, ..., z , t 1o perform an n input function f (zl,zz, ..., z, ). This, along with low gate capacitances, reduces the delay. However, in pass transistor logic, degraded voltage level, and hence, reduced noise margin is prohibitive. Pasternak et al 131 and Jayasumana et al [12] have reported the use of pass transistor logic and attempted to solve the problems of degraded voltage level and noise margin. But none of these two methods is efficient and can be used extensively in general.
In pass transistor logic, the basic building blocks are nMOS and PMOS transistors as in Figure 1. An nMOS(pM0S) transistor is a four-terminal device with terminals source, drain, gate and bulk. An input In1 at the drain is steered 'to the source by the input In2 at the gate. In nMOS(pMOS), the source and the drain potentials Vs and VD respectively are related by VD 2 Vs (VD 5 .Vs). In nMOS(pMOS), when In1 and In2 are at logic l'(1ogic 'O'), the logic level of the output, Out, at the source terminal is degraded threshold voltage of the hody-leffected nMOS (PMOS) transistor.
K. Yano et al [2] describe an excellent method of exploiting the advantages of pass transistors and surmounting the associated problems. Their methodol- CPL offers several advantages. The threshold modified CPL reported in [2] has delays 2-2.5 times lower than fully complementary CMOS (FCC). The pbwer dissipation is approximately 30% lower than that of FCC. However, CPL requires threshold voltage adjustment of the devices which is the key to CPL design. Adjustment of threshold voltage for selected devices and maintaining an accurate threshold voltage requires specialised fabrication processes. In general, for a common designer it is difficult to provide easy access to such a process. Hence Fang Lu and H. Samueli [ll] contend that CPL may not be useful for general purpose design. They propose a complex solution by way of an adaptively biased pseudo-nMOS logic (APNL).
In the following we demonstrate that CPL can indeed be used under normal process conditions, Le., without threshold adjustment. We call this logic family NPCPL. Compared to CPL, NPCPL has a degradation of performance (in terms of speed and noise margin), yet NPCPL outperforms any of the other logic families. As we will elucidate later, NPCPL is best suited for both low latency and high throughput applications.

3
In CPL, a basic building block is a two-input pass transistor logic block which can be configured as AND/NAND, OR/NOR and XOR/XNOR modules. When such modules are combined to form arbitrary boolean functions, data lines are loaded heavily. While we retain the basic topology of CPL building blocks in NPCPL, we address the issue of loaded data lines. Figure 2 gives a schematic of such a module, and the various configurations are given in Table 1 (1) to ensure equal noise margin for low-to-high and high-to-low transitions. This is the key in NPCPL design. Without this adjustment, the noise margin high, N M H is severely degraded and may be reduced even to zero.
In CPL, threshold adjustment was necessary to handle degradation of voltage levels which affected the static power dissipation, speed of operation and noise margin. In the following we show how NPCPL can be used for high throughput and low latency applications while doing away with threshold voltage adjustment of selected devices. We give a design methodology by which NPCPL design can be steered to function optimally to meet the design objectives of 1. Low Latency 2. High Throughput 3. Low Area We revisit the issues of static power disipation and noise margin in the context of NPCPL design for low latency and high throughput application in the subsequent sections. The issues of speed of operation, noise margin, static power dissipation etc. are related to the degraded voltage level for a logic '1' at the output of the nMOS pass transistor block. The proposed design procedures for NPCPL centres around preventing this voltage level from degrading any further beyond The optimisation, however, is process sensitive. For the rest of the paper all optimisation steps discussed are specific to the process we use (where p ...... f J -. . .~l ( P , (Q;)z, ) 1 a: ZC""' I). In particular, a process with VTh,(O) <I VThp(o) I is useful for decreasing static power dissipation as illustrated later. However, our approach is general enough to be exploited in any ptocess.

NPCPL for Low Latency Application
In a pass transistor configuration the delay increase is a quadratic function of the the number of series pass transistors [l3]. Hence, the logic should be partitioned into smaller blocks and the voltage level restored at intermediate points by inverters. We have experimentally found a critical path consisting of two pass transistors followed by an inverter to be optimal for each building block. This restricts the number of input variables for each block to three. This is sufficient to span a class of arithmetic functions. For example, in arithmetic circuits the basic building block is the full adder and it is a three input logic function.
A general pass transistor network consisting of N nMOS transistors (followed by the level restoring inverter) in the critical path is shown in Figure 3.
The voltage level of a logic '1' at the output of the nMOS pass block is given by Cascading many pass transistors in series causes the delay to increase drastically. On the other hand, if we restrict to just one pass transistor, the overhead of the level restoring inverter may offset the advantages anticipated. Through simulation, we determined the optimal number of pass transistors in series chain as that of two. The size of the pass transistors is to be determined through proper simulation since increasing the width of the transistors decreases the resistance but increases the capacitance. The performance is layout-specific too.
A degraded voltage level implies higher delay. It is because the output voltage reaches (VDD -VT~,) only asymptotically, the n-transistor in the output inverter would not be fully turned 'ON' (may enter linear region instead of being in saturation). It then has to be provided with more drive, i.e., its width is to be increased compared to that of the p-transistor. Consequently, gate capacitances increase, and hence, larger than minimum feature size (unit size transistor) contribute to delay and power dissipation. The issues of delay and power dissipation are discussed in detail through the following sections. In summary, logic level degradation can be min imisetl through a design practice as enumerated below: 1. Restrict number of sleries pass transistor to two, 2. Use properly sized pass tritnsistors, 3. Draw smallest diffusion lines while forming transistors, 4. Route gate signals in meta.1 and change to lpolysilicon near the transistors. Design practice (2) strikes a balance between internal capacitances to be charged and the resistances of the charging path, thereby optiniising the delay. Design practices (3) and (4) help reducing the overall1 delay.
The effect of parasitics is to degrade the voltage level at the output of the pass block., and hence, increase the time taken to reach the level (VDD -VT~,,). The above thumb-rules ensure that this effect is minimised.

Static Power Dissipation
Static power dissipation is present in NPCPL designs. Careful design can however render keeping the dissipation within tolerable limits. The logic level 'l', at the output of a pass block (feeding the level restoring inverter) is degraded to1 4 volts, resulting in a power dissipation of 5 pWatt per NF'CPL building block and 10 pWatt per full adder in the standby mode. It therefore follows that an NPCPL design with a logic complexity of 10,000 two input logic gates (conskting of 80,000 transistors) will suffer a static power ldissipation of 50 mW and that, with 10,000 full adders (consisting of 2,80,000 transistors) will dissipate 100 mW. For good NPCPL designs, however, the overall power dissipation is low compared to FCC.

Noise Margin
The noise margin of the NPCPL is less than that of the FCC by iVThn. However, price paid in terms of noise margin for the choice of a simple process (that does not require threshold adjustment) is an acceptable and affordable trade-off. The overall degradation in the noise margin is not disastrous. The methodology followed here ensures that the degradation does not lead to malfunctioning. A full adder (in C3TU), which is a complex three input NPCPL gate, clocked at 500 MHz exhibits a noise margin of 1V which is an acceptable figure. The effect of a number of series pass transistors is important in the context of noise margin. Too many pass transistors in series decrease the noise margin further. With proper design, noise margin is kept within tcilerable limit.

NPCPL for High 'Throughput Application
High throughput applications derive their perforing is often restricted because in a highly pipelined system the area overhead of pipelining latches and their associated delays far exceeds the area and the latency . In NPCPL, it is possible to design each pipeline stage as elementary as a generic NPCPL building block. Thus NPCPL permits us to exploit fine grain pipelining at no extra overhead of area and latency due to the latches. The overheads of the latches are subsumed in the inverters as detailed below. In C3TU, 1 . 6~ process, it is possible, in principle, to have pipeline stage delay as low as 0.6ns and have an overall pipelined system operating at 800MHz using two-phase clocking. When maximising the throughput is the main criterion, as in a typical signal processing environment, the logic depth has to be minimum. Hence, two input NPCPL logic blocks, with just one pass transistor followed by another clocked pass transistor in series feeding the level restoring inverter for the combinational block, are most suitable. As seen from Figure 4, two clocked pass transistors T I , Tz have been introduced between the two-input logic block and the inverter so that the transistor along with the level restoring inverter can serve as a dynamic latch as shown in Figure 4. Thus a conventional two-phase clocking scheme (pp. 207-209 of [l] can be used to clock the pipelining stages.
Since the combinational pass block generates differential outputs f and 7, no additional inverters are necessary.
Observe that in this approach the additional overhead of pipelining is that of a single clocked transistor at the end of every pass block. In contrast, fine grain pipelining in other logic families are prone to both area and latency penalty close to 50% of their unpipelined counterparts [ 141. Figure 5 gives the schematic of a full adder, which is the basic building block for all types of arithmetic circuits, for high-throughput applications. It has two stages each having a critical path of one pass transistor followed by a level restoring inverter. Introducing a clocked transistor before the inverter will render each stage into a pipeline stage.

Static Power Dissipation
In the high throughput NPCPL system, the number of transistors in the series pass chain is twoone for implementing the combinatorial function and the other Figure 5: A Full Adder for High Throughput Application for clocking. Static power dissipation is comparable to that of NPCPL design for low latency applications and follows from section 3.1.1.

Noise Margin
Two pass transistors in series provides enough noise margin and follows directly from the discussion in section 3.1.1 and 3.1.2.

Optimisation for Area
When the goal is to optimise area, pass transistors are to be used extensively. The functions are implemented using a complex pass network, using either a tabular method or the Karnaugh map, following the methodology of Damu Radhakrishnan e t al [13]. At the end of this network the level restoring inverter is added. Given sufficient time, the pass network output reaches (VDD -V T~, ) , which can then be restored to true logic values. This ensures low static power dissipation and adequate noise margin as illustrated before.
Generally NPCPL employs two pass blocks to generate differential outputs. However, for low area NPCPL designs, it is advisable to use a single pass block to generate complemented(true) output and invert it by the level-restoring inverter to get the true(comp1emented) output. Additional inverters are required for obtaining the complemented( true) outputs. A full adder which optimises area using the methodology stated above is shown in Figure 6. Note that it accepts differential inputs and generates differential outputs while using one pass block only for each of the sum and carry parts. The extra overhead of routing dual-rail signals is amortised by the low area requirement of a single pass block.
It takes an area of 63 x 57p2 , which is approximately 40% of that occupied by a high performance full adder in FCC and approximately 50% of that occupied by a low area full adder in FCC; yet it outperforms both the full adder circuits in FCC. It consumes 0.9mW of power at lOOMHz which is less than 50% of the power consumed by the FCC counterpart. Its static -B B C the 1.61.1 C3TU process). The simple yet ei€ective method of voltage llevel restoration induces minimum delay and makes NPCPL faster thitn any other nMOS pass transistor logic implementation (excepting of course CPL). Use of minimum sized transistors as pass elements decreases gate capacitance. All these contribute to the higher speed of operation of NPCPL.

Low Latency and Nigh Throughput Op-
eration : NPCPL can be employed to exploit low latency and high throughput simultaneously. Pipelining can be introduced to the finest grain without any significant increase in overall latency. This, along with the low alelay of the logic blocks makes NPCPL ideal for both low latency and high throughput applications simultaneously as against any other logic where a high throughput is beset with high latency. . This is a serious constraint and often prohibits the use of wave pipelining in general. However, NPCPL, by virtue of its symmetric structure, is a promising logic family to be ex ploitetl for wave pipelining.

Area
: NPCPL for low a,rea applications offers significant advantage of area as mentioned in section 3.3. In this casle also the delay is less than that of FCC circuits as may be seen from the Table 2.
The NPCPL designs for high-speed operations, offer moderate advantages in terms of area. Area optimisation is not so drastic in this case as delay because of extra interconnections required in dual-rail logic. A full adder implementfed in NPCPL takes an area of 92 x 90y2 while the same in FCC takes 100 x 100p2.
Because of the dual-rail logic, NPCPL has balanced delay among different propagation paths. This is essential for achieving high throughput, and a desirable feature for a combinatorial design. NPCPL can implement the XOR function very efficiently, and hence, is ideal for arithmetic circuits.
6. Sea-of-Gates Approach : NPCPE can be easily realized on sea-of-gates technology. Using the same generic NPCPL building block, by proper configuration of inputs, different logic functions are easily realised. Hence, in sea-of-gates approach, basic NPCPL generic building blocks can form the basic cells arrayed out on a mastel. die. Figure 6: A Full Adder for Low Area and Low Power Dissipation power dissipation is also very low -1.7pWatt. Note that the static power dissipation here is determined statistically, viz., depending upon the probability of SUM and CARRY being '1'.

The Adaptively Biased Pseudo-nMOS Logic-APNL
[ll], is good for non-pipelined cases, but for pipelining it is not useful. The DSL, uses short channel nMOS devices and hence, requires specialised process technology [ll]. Its speed is achieved from this factor also. The pass transistor logic in NPCPL provides speedup from the inherent structure of logic realisation while keeping the area low. It is absolutely free from the charge sharing problem which is a potential hazard in all dynamic logics. Its dynamic power dissipation is also lower than those of the others. The methodology proposed above allows to use NPCPL under normal process condition, thereby eliminating the specialised process requirements of CPL  As seen from the discussion in section 3.1.1, the static power dissipation in NPCPL is not a serious concern. Because of the low gate capacitances and low voltage swing (0 to (VDD -V T~~) ) at the internal nodes, NPCPL has a lower dynamic power dissipation than that of FCC. The contribution of static power dissipation to the overall power dissipation is well amortised by the low dynamic power dissipation figure of NPCPL.
The only disadvantage of NPCPL is its relatively low noise margin as compared to static CMOS. However, all the high-speed logic families have noise margin less than that of FCC. Hence, an acceptable degradation in noise margin is a reasonable trade-off for speed of operation and elegant pipelining methods. It is a general observation that all the high-speed logic families have noise margin less than that of CMOS and NPCPL is no exception. We summarise the area, power and speed of full adders in NPCPL and FCC in 1 . 6~ C3TU process in Table 2.

Conclusions and Further Research Directions
In this paper we have presented a Normal Process Complementary Pass Transistor Logic (NPCPL) for low latency and high throughput applications. We have shown that NPCPL offers the best speed of operation comparable to CPL. It permits pipelining to the finest grain with negligible overhead of area and latency as opposed to other logic families where an increase in pipelining throughput is encumbered with heavy area and latency penalty. With NPCPL, it is possible to exploit both latency and throughput simultaneously to the maximum realisable extent. Because of its modularity and higher logic functionality, NPCPL bears the potential for a sea-of-gates realisation. As a further research direction we are working towards exploiting NPCPL design for wave pipelining.