# LOW POWER ASYNCHRONOUS DOMINO LOGIC PIPELINE DESIGN STRATEGY BASED ON CRITICAL DATAPATH

Mrs.C. Aarthi, M. Karthika

Head of the Department, Dept of ECE, M.E Student

Sengunthar Engineering College, Tiruchengode

#### **ABSTRACT**

In today's world pipelining is a key element of high-performance design. Distributed synchronization is one of the key strengths and one of the major difficulties of synchronous pipelining. It automatically provides elasticity and on-demand power consumption. For that, this project presents a survey on high-throughput and ultra low-power asynchronous pipeline design method targeting to latch-free and extremely fine-grain design. Since they are asynchronous, these pipelines avoid problems related to high-speed clock distribution, such as clock power, clock skew, and rigidity in handling varied environments. The survey is mainly done on the data path logic. The data path may be single-rail, dual-rail or combination of the both logic. Asynchronous pipeline based on constructed critical data-path is combination of both the data path. Critical path compose of dual- rail logic and noncritical enables singlerail logic. Based on this critical data path, the handshake circuits are simplified, which offers the pipeline low power consumption as well as high throughput by reducing the overhead problems. This design is going to be implemented by TANNER EDA simulations model.

Intex items: Asynchronous pieline, Critical datapath, Domino logic

## I. INTRODUCTION

High performance energy efficient logic style is one of the famous research topics in the field of VLSI circuits because of the continuous demands of power, speed and area constraints. As transistor integration increases, new advances are established in the concerned design strategies and related tools are used to design the VLSI circuits, which results cost effective and low power high performance VLSI circuits.

Low-power VLSI circuit design is a dynamic research area driven by the growing reliance on battery-powered portable computing wireless communication products. In addition, it has become critical to the continued progress of high-performance and reliable microelectronics system. The earliest and still the most urgent demands for low power electronics originate from the requirements for the small size and weight, long operating life, utility and reliability of battery operated equipment such as wrist watches, pocket calculators and cellular phones, hearing aids. Implantable cardiac pacemakers and a myriad of portable military equipment used by individual foot soldiers. Perhaps no segment of the electronics industry has a growth potential as explosive as that of the personal digital assistant (PDA) which has been characterized as a combined pocket cellular phone, pager, e-mail terminal, fax, computer, address directory, etc.

The VLSI low power design problems can be broadly classified into two: analysis and optimization. Analysis problems are concerned about the accurate estimation of the power or International Journal of Advanced Research in Biology Engineering Science and Technology (IJARBEST) Vol. 2, Special Issue 10, March 2016

energy dissipation at different phases of the design process. The purpose is to increase confidence of the design with the assurance that the power consumption specifications are not violated. Evidently, an analysis technique differs in their accuracy and efficiency. The accuracy of analysis depends on the availability of the design information. In early design phases, emphasis to obtain power dissipation estimates rapidly with very little available information on the design. In these phases, less analysis results are expected and tolerated. As the design proceeds to reveal more lower- level details, a more accurate analysis can be performed. Here, better accuracy is demanded and longer analysis time is allowed. Analysis techniques also serve design, given an optimization goal, without violating design specifications. An automatic design optimization algorithm requires a fast analysis engine to evaluate the merits of the design choices. Manual optimization also demands a reliable analysis tool to provide accurate estimation of power dissipation.

A decision to apply a particular low power technique often involves trade-offs from different sources pulling in various directions. Major criteria to be considered are the impact to the circuit delay, which affects the performance and throughput of the chip, and the chip area, which directly translates to manufacturing costs. Other factors of chip design such as design cycle time, testability, quality, reliability, reusability, risk, etc., may all be affected by a particular design decision to achieve low requirement. Power efficiency cannot be achieved without yielding to one or more of these factors. The task of a design engineer is to carefully weigh each design choice within the specification constraints and select the best implementation.

# a)Pipelining

Pipelining is generally used to achieve high performances digital system design. It is classified in to two types as synchronous and asynchronous pipelining. In synchronous system, is a straight forward technique which is used to increase parallelism and hence boost system throughput. Synchronous pipelining design consists of complex functional blocks which are subdivided into smaller blocks; registers are inserted to separate functional blocks. The global clock is applied to all registers. But in asynchronous pipeline design, local clock is applied to all registers. This pipeline design send data from left to right and an acknowledge control signal is send from right to left that is bidirectional communication, which is implemented by handshaking protocol. Asynchronous pipeline is classified based on the data path logic as static and dynamic logic. Each class uses different approaches for control and data storage. Four important features in providing design flexibility and modularity for asynchronous design are as follow. First, in synchronous systems, all stages operate at the same fixed rate, and the worst-case stage delay must be less than the clock period. In Asynchronous system, all stages need not have equal delay. Due to dynamically varying delay asynchronous adder have data dependent problem. This should be avoided to improve average system latency and throughput. Second, asynchronous pipelines provide elasticity. Input data items arrive in irregular manner hence the spacing and throughput rate are determined dynamically. Third, asynchronous pipelines provide automatic flow control whereas synchronous pipeline have no flow control. Finally switching activity occurs only when data items are being processed, so dynamic power consumed only on demand. Thus asynchronous pipelining is more efficient than synchronous pipelining. Asynchronous pipelining uses four

phase dual rail protocol for handshaking and dual rail or single rail data path logic for valid data passage.

# B) Problems With Synchronous Approach

synchronous approach predominated, largely because it is easier to design chips in which things happen only when the clock ticks. As chips get bigger, faster and more complicated, distributing the clock signal around the chip becomes harder. Another drawback with clocked designs is that they waste a lot of energy, since even inactive parts of the chip have to respond to every clock tick. Clocked chips also produce electromagnetic emissions at their clock frequency, which can cause radio interference. Each tick must be long enough for signals to traverse even a chip's longest wires in one cycle. However, the tasks performed on parts of a chip that are close together finish well before a cycle but can't move on until the next tick. As chips get bigger and more complex, it becomes more difficult for ticks to reach all elements, particularly as clocks get faster.

In today's chips, the clock remains the key part of the action. As a microprocessor performs a given operation, electronic signals travel along microscopic strips of metal forking, intersecting again, and encountering logic gatesuntil they finally deposit the results of the computation in a temporary memory bank called a register. Let's say you want to multiply 4 by 6. If you could slow down the chip and peek into the register as this calculation was being completed, you might see the value changing many times, say, from 4 to 12 to 8, before finally settling down into the correct answer. That's because the signals transmitted to perform the operation travel along many different paths before arriving at the register; only after all signals have completed their journey is the correct value assured. The role of the clock is to

guarantee that the answer will be ready at a given time. The chip is designed so that even the slowest path through the circuit-the path with the longest wires and the most gates-is guaranteed to reach the register within a single clock-tick. The chip's clock is an oscillating crystal that vibrates at a regular frequency, depending on the voltage applied. This frequency is measured in gigahertz or megahertz. All the chip's work is synchronized via the clock, which sends its signals out along all circuits and controls the registers, the data flow, and the order in which the processor performs the necessary tasks. An advantage of synchronous chips is that the order in which signals arrive doesn't matter.

Signals can arrive at different times, but the register waits until the next clock tick before capturing them. As long as they all arrive before the next tick, the system can process them in the proper order. Designers thus don't have to worry about related issues, such as wire lengths, when working on chips. And it is easier to determine the maximum performance of a clocked system. With these systems, calculating performance simply involves counting the number of clock cycles needed to complete an operation

## c) Asynchronous Logic Circuits

As its name suggests, it does away with the cardinal rule of chip design: that everything marches to the beat of an oscillating crystal "clock". For a 1GHz chip, this clock ticks one billion times a second, and all of the chip's processing unit's co-ordinate their actions with these ticks to ensure that they remain in step. Asynchronous, or "clockless", designs, in contrast, allow different bits of a chip to work at different speeds, sending data to and from each other as and when appropriate.

International Journal of Advanced Research in Biology Engineering Science and Technology (IJARBEST) Vol. 2. Special Issue 10. March 2016



Figure 1. An abstracted view of synchronous (a) versus asynchronous (b) pipelines

# d) How Clock-Less Chip Works

There are no purely asynchronous chips yet. Instead, today's clockless processors are actually clocked processors with asynchronous elements. Clockless elements use perfect clock gating, in which circuits operate only when they have work to do, not whenever a clock ticks. Instead of clock-based synchronization, local handshaking controls the passing of data between logic modules. The asynchronous processor places the location of the stored data it wants to read onto the address bus and issues a request for the information. The memory reads the address off the bus, finds the information, and places it on the data bus. The memory then acknowledges that it has read the data. Finally, the processor grabs the information from the data bus.

# a)Simple, efficient design

Logic modules could be developed without regard to compatibility with a central clock frequency, which makes the design process easier. Also, because asynchronous processors don't need specially designed modules that all work at the same clock frequency, they can use standard components. This enables simpler, faster design and assembly. However, the recent use of both domino logic and the delayinsensitive mode in asynchronous processors has created a fast approach known as integrated pipelines mode.

Domino logic improves performance because a system can evaluate several lines of data at a time in one cycle, as opposed to the typical approach of handing one line in each cycle. Domino logic is also efficient because it acts only on data that has changed during processing, rather than acting on all data throughout the process. The delay-insensitive mode allows an arbitrary time delay for logic blocks. "Registers communicate at their fastest common speed. If one block is slow, the blocks that it communicates with slow down. This gives a system time to handle and validate data before passing it along, thereby reducing errors.

# b) Different styles

There are several styles of asynchronous design. Conventional chips represent the zeroes and ones of binary digits using low and voltages on a particular wire. One clock-less approach, called "dual rail", uses two wires for each bit. Sudden voltage changes on one of the wires represent a zero, and on the other wire a one. "Dual-rail" circuits use two wires giving the chip communications pathways, not only to send bits, but also to send "handshake" signals to indicate when work has been completed. Replacing the conventional system of digital logic with what he calls "null convention logic," a scheme that identifies not only "yes" and "no," but also "no answer yet" a convenient way for clock-less chips to recognize when an operation has not yet been completed. Another approach is called "bundled data". Low and high voltages on 32 wires are used to represent 32 bits, and a change in voltage on a 33rd wire indicates when the values on the other 32 wires are to be used.

# c) Asynchronous for Low Noise and Low Emission

Sub circuits of a system may interact in unintended and often subtle ways. For example,

a digital sub circuit generates voltage noise on the power-supply lines or induces currents in the silicon substrate. This noise may affect the performance of an analog-to-digital converter connected so as to draw power from the same source or that is integrated on the same substrate. Another example is that of a digital sub circuit that emits electromagnetic radiation at its clock frequency, and a radio receiver subcircuit that mistakes this radiation for a radio Due to the absence of a clock, asynchronous circuits may have better noise and EMC (Electro-Magnetic Compatibility) properties than synchronous circuits. This advantage can be appreciated by analyzing the supply current of a clocked circuit in both the time and frequency domains signal.

Circuit activity of a clocked circuit is usually maximal shortly after the productive clock edge. It gradually fades away and the circuit must become totally quiescent before the next productive clock edge. Viewed differently, the clock signal modulates the supply current as depicted schematically in Figure. Due to parasitic resistance and inductance in the on-chip and off-chip supply wiring this causes noise on the on-chip power and ground lines.

In this paper, our proposed pipeline reduces both the dual-rail encoding overhead in data paths and the detection overhead in hand- shake control logic by designing based on a constructed critical data path. A stable critical data path is constructed using redesigned dual-rail domino gates. By detecting the stable critical data path, a 1-bit completion detector is enough to get the correct handshake signal regardless of the data path width. Such design does not only greatly reduce the detection overhead but also partially maintains the good properties in the four-phase dual-rail protocol design. Moreover, the stable critical data path serves as a matching delay to solve the dual-rail encoding overhead

problem in data paths. With the help of the redesigned dual-rail domino gates, single-rail domino logic is successfully applied in noncritical data paths. As a result, the asynchronous domino logic pipeline has a small overhead in both handshake control logic and function block logic, which greatly improves the circuit efficiency.

This paper is organized as follows. Section II introduces the background of asynchronous domino logic pipeline. PS0 is introduced to demonstrate the advantages and problems of asynchronous domino logic pipeline based on dual-rail protocol. Several related designs are also simply introduced. Synchronizing logic gates (SLGs) and synchronizing logic gates with a latch function (SLGLs) are introduced to construct a stable critical data path. The robustness of the pipeline structure and the constructed critical data path is analyzed. Then, more complex pipeline structures are further discussed. Section III focuses on comparison of different parameters of the pipeline structures. Section IV presents the conclusion.

# II. BACKGROUND

Williams' PS0 pipeline, which is the starting point of the new pipeline design. Dual-rail is a commonly used method to execute asynchronous data-path. Each pipelining stage consists of a dual -rail data-path, functional block and completion detector. In dual rail, two wires indicate both, the value of the bit and also its validity. The completion detector generate local handshake signal to indicate the presence or absence of data at the outputs of the functional block. The encoding data 00 indicates spacer state, 11 is an unused state. The encoding of 10 and 01 correspond to valid data values 1 and 0 respectively. The protocol of PS0 is simple, single data flow starts with an empty pipeline; the complete cycle of events is as follow: F1 start to evaluate and then the data flow to F2. F2 evaluate and data flow to F3,

F2's completion detector detects completion of evaluation and sends a pre-charge signal to F1. The two main overhead problems of PS0 pipeline are detection overhead in headshake control logic and the dual rail encoding overhead in functional block logic. These problems are solved in four phase dual rail protocol pipeline style.



Figure 2 .Williams' PS0 pipeline design
1) Four-Phase Dual-Rail Protocol

William's PS0 is designed based on the fourphase dual-rail protocol of data transfer based on the four-phase dual-rail protocol, and four-phase dual-rail encoding. The four phase dual-rail encoding encodes a request signal into the data signal using two wires,  $(w_t, w_f)$ . The data value 0 is encoded as (0, 1), and value 1 is encoded as (1, 0); the spacer is encoded as (0, 0); (1, 1) is not used. When transferring the valid data, a spacer is inserted between them. A receiver can easily obtain the valid data by monitoring the two wires. This protocol is very robust since a sender and a receiver can communicate reliably regardless of delays in the combinational logic block and wires between them. The dualrail encoded data path is known as the delay-insensitive data path.

2) Data transfer Four-Phase Dual-Rail Protocol



Figure 3.Data transferring of four phase dual rail protocol

3) Structure of Four Phase Dual Rail PS0 Pipeline

In PSO, each pipeline stage is composed of a function block and a completion detector. Each function block is implemented using dual-rail domino logic. Each completion detector generates a local handshake signal to control the flow of data through the pipeline.

A two-input NOR gate serves as the 1-bit completion detector to generate a bit done signal by monitoring the outputs of dual-rail domino gate. To build a 2-bit completion detector, C-element is needed to combine the done signals. A full completion detector is informed by combining all bit done signals from the entire data paths with a tree of C-elements.



Figure 4. Structure of PS0 pipeline design

# *4) The protocol for PS0*

F(N) is precharged when F(N + 1) finishes evaluation. F(N) evaluates when F(N+1) finishes its reset, or precharge. if we observe a single data flow through an initially empty pipeline in which every pipeline stage is in evaluation phase, the complete cycle of events is as follows

- 1) F1 evaluates and data flow to F2.
- 2) F2 evaluates and data flow to F3. F2's completion detector detects completion of evaluation and sends a precharge signal to F1

- 3) F1 precharges and F3 evaluates.
- F3's completion detector detects completion of evaluation and sends a precharge signal to F2.
- 4) F2 precharges. F2's completion detector detects the completion of precharge and sends an evaluation signal (enable signal) to F1. The evaluation signal enables F1 to evaluate new data once again.

There are three evaluations, two completion detections, and one precharge in the complete cycle for a pipeline stage. The pipeline cycle time *T*cycle is

$$T$$
cycle =  $3t$ Eval +  $2t$ CD +  $t$ Prech

# 5)Delay Assumptions in PSO

#### Quasi-delay-insensitive (QDI)

Circuits partition wires into critical and noncritical categories. Designers of such circuits consider fanout in critical wires to be safe by assuming that the skew between wire branches is less than the minimum gate delay. Designers thus assume these wires, which of course must be constrained to physically lie within a small area of the chip, to be isochronic. In contrast, noncritical wires can have arbitrary delays on fanout branches.

## **Bundled-delay (BD)**

Circuits assume that the maximum delay of each combinational logic island is smaller than that of a reference logic path (usually called matched delay). Matched delays are implemented using the same gate library as the rest of the datapath and they are subject to the same operating conditions (temperature, voltage). This results in consistent tracking of datapath delays by matched delays and allows one to reduce design margins with respect to synchronous design.

#### **Drawbacks of PSO**

The detection overhead in handshake control logic greatly affects the pipeline speed and power consumption. The most serious problem is that detection overhead is growing the width

of datapaths. Even the detection time can be reduced by partitioning wide datapath into several data streams. The dual-rail encoding overhead is caused by dual-rail domino logic that is used for not only implementing logic function but also storing data between pipeline stages. Because there are no explicit storage elements, a lot of buffers have to be added to levelize each stage.

# *B)* Advanced pipeline techniques

# 1. Pre Charge Half Buffer Pipeline (PCHB)

PCHB is a timing-robust pipeline style that uses quasi delay- insensitive control circuits . Two completion detectors in a PCHB stage: one on the input side (Di) and one on the output side (Do). The complete cycle of events for a PCHB stage is quite similar to that of PSO, except that a PCHB stage verifies its input bits. Because of the input completion detector (Di), a PCHB stage does not start evaluation until all input bits are valid. This design absorbs skew across individual bits in the stage: one on the input side (Di) and one on the output side (Do). The complete cycle of events for a PCHB stage is quite similar to that of PS0, except that a PCHB stage verifies its input bits. Because of the input completion detector (Di), a PCHB stage does not start evaluation until all input bits are valid. This design absorbs skew across individual bits in the data paths.

# 2. Structure of pre-charge half buffer



Figure 5. Structure of PCHB

All transistors are minimum length and width, except for the weak inverters, which require multiple times minimum length. The weak inverter transistor lengths are also chosen to be as small as possible, but still achieve correct circuit functionality. An input register and corresponding completion logic are included in the static NCL implementation to make yield a fair comparison. As expected, the dynamic **PCHB** implementation gives the performance and requires the least transistors. However, it is not delay-insensitive and its application is limited, because states will be lost without periodical refreshing and it is vulnerable to charge sharing problems.

# 3)Highperformance dynamic asynchronous pipeline: Lookahead style A new class of asynchronous pipelines is proposed, called lookahead pipelines (LP), which use dynamic logic and are capable of delivering multi-gigahertz throughputs. Since they are asynchronous, these pipelines avoid problems related to high-speed clock distribution, such as clock power, management of clock skew, and inflexibility in handling varied environments. The designs are based on the well-known PS0 style of Williams and Horowitz as a starting point, but achieve significant improvements through novel protocol optimizations: the pipeline communication is structured so that critical events can be detected and exploited earlier. A special focus of this work is to target extremely fine-grain or gatelevel pipelines, where the datapath is sectioned into stages, each consisting of logic that is only a single level deep. Both dual-rail and single-rail pipeline implementations are proposed. All the implementations are characterized by low cost control structures and the avoidance of explicit latches



Figure 6.Block diagram of Lookahead pipeline

While dual-rail datapaths allow variable-speed completion and have been effectively used in a number of applications, the area penalties are sometimes unacceptable in practice and the power overhead may be large single-rail design has much wider applicability in the synchronous world, and several asynchronous groups have recently moved from dual- to single-rail design. Therefore, the focus of this section is on singlerail lookahead pipelines. The target design style is a commonly-used approach called "bundleddata, where synchronous function blocks can be used along with attached matched delays. Two single-rail lookahead pipelines introduced. The first design LP 2/2, essentially adapts the dual-rail LP2/2 design of single-rail datapaths. As in the former design, an early done optimization is used in Synchronous Pipelines: Several novel synchronous pipelines have been proposed for high-throughput applications. In wave pipelining, multiple waves of data are propagated between two latches However, this approach requires much design effort, from the architectural level down to the layout level, for accurate balancing of path delays, and remains highly vulnerable to process, temperature and voltage variations. These all require complex timing constraints which are difficult to verify; also lack elasticity the ability to accommodate environments with dynamically varying speeds and still require high-speed clock distribution. Single-Rail global Asynchronous Pipelines: The classic single-rail asynchronous pipelines, introduced by Sutherland, are called micro pipeline.

# 4) LP3/1 Pipeline Design

In LP3/1 pipeline design style, an early evaluation protocol is used. This protocol receives control information not only from the subsequent stage, but also from its successor stage in pipeline. The key idea of the new protocol is that instead of waiting until N+1 stage has completed pre-charging, N stage can evaluate as soon as N+1 stage has started pre-charging. As a result, LP3/1 pipelines have shorter cycles than Williams' PS0 pipelines design but it has longer critical path delay than Williams' PS0 pipelines. The analytical cycle time of LP3/1 pipeline is given as,

#### TLP3/1=3.TEVAL+TCD+TNAND

LP3/1 pipeline design will be slower than PS0 pipeline due to greater capacitive Increased loading typically causes logarithmic overheads to the power, area, and latency of the completion detectors. This problem is solved by the LP3/1 simply restructuring Pipeline. Enhanced Version of LP3/1 Pipeline: Simplifying the stage interfaces. The same protocol is used in Enhanced version of LP3/1 pipeline, but now there is direct communication between adjacent stages. There are two benefits of the simpler stage interface 1) reduced wiring loads, therefore overall wire length and critical path delay is reduced, which can be a significant benefit in future fabrication technologies 2) Greater and easy interfacing with the environment, produces one acknowledgment the right environment, and one acknowledgment for the left environment.

# 5)LP2/1 Pipeline Design

LP2/1 pipeline design combine both the "early evaluation" optimization of LP3/1 pipeline design and the "early done" optimization of LP2/2 pipeline design. Each stage uses information from two succeeding stages pipeline has the shortest analytical cycle time, and also

employs early completion detection by this handshake overhead problems are reduced.

The analytical cycle time of LP2/1 pipeline is given as.

#### TLP2/1=2.TEVAL+TCD+TNAND



Figure 7. LP2/1 pipeline design

# 6) Drawbacks of Lookahead style Asynchornous Domino Logic Pipeline

A pipeline style uses dual-edge triggered Dflipflops instead of capture-pass latches; while this design is simpler, it suffers from significant delays due to the overhead of the complex flipflops. This approach use four-phase control instead of two-phase control. While the new protocols allow use of simpler latches, the controllers themselves are quite complex. Because no acknowledge signals are used, the pipeline only operates in a "feed forward" mode, without any backwards flow control, as is also true with synchronous wave pipelines data items can be lost if the destination environment is momentarily stalled. Further, the approach still requires much design effort for balancing path delays: dummy transistors and capacitances are often required to be inserted at specific nodes, and the fanin and fanout of gates must be carefully controlled.

# III ASYNCHRONOUS PIPELINE BASED ON CRITICAL DATAPATH

The pipeline is designed based on a stable critical data path that is constructed using special dual-rail logic The critical data path transfers a data signal and an encoded handshake signal.

International Journal of Advanced Research in Biology Engineering Science and Technology (IJARBEST) Vol. 2, Special Issue 10, March 2016

Noncritical data paths, composed of single-rail logic, only transfer data signal. A static NOR gate detects the dual-rail critical data path and generates a total done signal for each pipeline stage. The outputs of NOR gates are connected to the pre-charge ports of their previous stages. The difference is that a total done signal is generated by detecting only the critical data path instead of the entire data paths. Such design method has two merits. First, the completion detector is simplified to a single NOR gate, and the detection overhead is not growing with the data path width. Second, the overhead of function block logic is reduced by applying single-rail logic in noncritical data paths. As a result, APDCP has a small overhead in both handshake control logic and function block logic, which greatly improves the throughput and power consumption. APCDP is more familiar to bundled-data asynchronous pipeline because the critical data path essentially works as a matching delay, which controls the correct data transfer in noncritical data paths. Compared with conventional bundled -data design using a separate matching delay to match the worst case delay in function blocks, the proposed design method reuses the existing function block logic to provide the matching delay. Such design has many advantages. First, the matching delay is accurate. The matching delay in APCDP is exactly the same worst case delay in function blocks. Second, the matching delay is robust for delay variations. Dual-rail critical data path supplies delay-insensitive property, which can be self-adaptive to delay variations in function blocks. Third, the handshake control logic is efficient in area and power. The handshake control logic is implemented by reusing the existing function block logic.

Finding a stable critical data path in function blocks is very important in the proposed design. The problem is that it is difficult to get a stable critical data path using traditional logic gates. Traditional logic gates have the gate-delay datadependence problem the gate delay is dependent on input data patterns. For example, the ripple carry adder .The ripple carry path seems to be the stable critical data path. However, actually, the critical data path varies according to different input data patterns. Because of the gate-delay data-dependence problem, the carry function gate can be triggered early by the input bits  $(a_n \text{ and } b_n)$  regardless of the carry bit. Since the input bit travels faster in the buffer path than the carry bit in the ripple carry path, it cannot guarantee that the critical transition signal always presents on the ripple carry path. Adding delay elements is an intuitive way to construct a stable critical data path. However, this method needs complex timing analysis and would cause huge overhead of delay elements. This method introduces an efficient solution that uses SLGs to construct the critical data path. The SLGs solve the gate-delay data-dependence problem by making sure that SLGs cannot evaluation until all valid data arrive. This feature does not only help to construct a stable critical data path but also enable the adoption of singlerail domino logic in the noncritical data paths. As a result, the proposed design is significantly area and power efficient.

# 1)Asynchronous data encoding

To encode data, the single control req wire is replaced by a data field. Two common approaches, each for four-phase handshaking. Dual-rail codes and delay insensitive encoding In dual-rail encoding, each bit is implemented by a pair of wires, or rails. In the reset phase, both rails are low, forming a spacer, which indicates no valid data. During the evaluate phase, the sender asserts exactly one of the two rails high, thereby indicating the actual value of the bit (0 or 1) as well as data validity. The receiver typically uses a completion detector to identify that a valid codeword has been received. The remainder of the four-phase protocol proceeds as expected: after the dual-rail data is transmitted, ack is asserted high by the receiver,

the dual-rail data is reset to zero, and ack is de asserted low by the receiver. Such an encoding is one instance of a delay-insensitive (DI) code in which each codeword uniquely identifies its own validitity.



Figure 8. asynchronous four phase & two phase communication protocol



Figure 9. asynchronous data encoding dual rail encoding(a) single rail bundled data(b)

# 2)Structure of Asynchronous Pipeline based on Constructed Critical Data Path

The solid arrow represents a constructed critical data path (dual-rail data path), the dotted arrow represents the noncritical data paths (single-rail data paths), and the dashed arrow represents the output of single-rail to dual-rail encoding converter. In each pipeline stage, a static NOR gate is used as 1-bit completion

detector to generate a total done signal for the entire data paths by detecting the constructed critical data path. Driving buffers deliver each total done signal to the pre-charge/evaluation control port of the previous stage. Since the completion detector only detects the constructed critical data path, the noncritical data paths do not have to transfer encoded handshake signal anymore. Therefore, single-rail domino gates are used in the noncritical data path to save logic overhead. Encoding converter is used to bridge the connection between single rail domino logic and dual rail domino logic.



Figure 10. Structure of Asynchronous domino logic pipeline

# 3)Logic Gates

In VLSI circuits, it is difficult to get a stable critical data path using traditional logic gates due to the gate-delay data-dependence problem. The true side of logic is implemented by out\_ $t = a_{\_}t \cdot b_{\_}t$  and the false side by out\_ $f = a_{\_}f + b_{\_}f$ . In traditional dual-rail domino AND gate, there are three transistor paths: these paths have different number of transistors at the sequential position. When they turn ON, respectively,  $[a_{\_}f]$  and  $[b_{\_}f]$  cause less delays than  $[a_{\_}t, b_{\_}t]$ . Moreover, when the data pattern is (0, 1, 0, 1),  $[a_{\_}f]$  and  $[b_{\_}f]$  will be both ON, which leads to a much quicker signal transfer. As a result, the gate delay has a large variation depending on different data patterns

# 4) Synchronizing Logic Gates

SLGs are dual-rail domino gates that have no data-dependence problem. gate-delay principle is that, in the pull-down network, there is exactly one path activated according to one data pattern, and the stack of all possible paths is kept constant at the sequential position. Compared with the traditional design, the false side logic expression is changed to out\_ $f = a_t$  $\cdot b_{-}f + a_{-}f \cdot (b_{-}t + b_{-}f).1) [a_{-}t, b_{-}t]; 2) [a_{-}t, b_{-}t]$  $b_{f}$ ]; 3) [ $a_{f}$ ,  $b_{t}$ ]; and [ $a_{f}$ ,  $b_{f}$ ]. Every path has two transistors at the sequential position, and there is only one path turns ON corresponding to an input data pattern. As a result, the gate delay becomes independent on different data patterns. This kind of gates is named as SLGs because they can synchronize their inputs. The SLGs verify that all data signal transitions have arrived on their inputs before changing their outputs.



Figure 11. Synchronizing Logic Gates

| pc | <u>a_t</u> | <u>a_f</u> | <u>b_t</u> | <u>b_f</u> | out_t | out_f |
|----|------------|------------|------------|------------|-------|-------|
| 0  | -          | -          | -          | -          | 0     | 0     |
| 1  | 0          | 1          | 0          | 1          | 0     | 1     |
| 1  | 0          | 1          | 1          | 0          | 0     | 1     |
| 1  | 1          | 0          | 0          | 1          | 0     | 1     |
| 1  | 1          | 0          | 1          | 0          | 1     | 0     |

Table I Truth table of Dual-rail AND logic 5) Synchronizing Logic Gates With a Latch Function

Based on the characteristics of SLGs, SLGLs are extended synchronizing AND gate with a latch function and the table of latch states. An SLGL has an enable port (en\_t, en\_ f), which controls the opaque and transparent state of the SLGL. The principle is that SLGLs cannot start evaluation without the presence of

the enable signal. Same as the dual-rail AND logic, all traditional dual-rail domino logic can be redesigned to become an SLG or an SLGL. The critical data path in dual-rail asynchronous pipeline can be easily constructed using SLGs and SLGLs.



Figure 12. SLG with Latch function

| pc | En_t | en_f | States      |
|----|------|------|-------------|
| 0  | -    | -    | Opaque      |
| 1  | 0    | 0    | Opaque      |
| 1  | 0    | 1    | Transparent |
| 1  | 1    | 0    | Transparent |
| 1  | 1    | 1    | No used     |

Table II Table of latch states.

# 6) Power Analysis

The energy consumption of VLSI circuits relates to the toggling rate in data paths. In APCDP, the adoption of single-rail domino gates in the noncritical data paths saves not only silicon area by reducing transistor count but also energy consumption by reducing the toggling Each energy consumption is an average value calculated from 100 cycles. The results show that APCDP consumes much less energy than LP2/2-SR. The adoption of single-rail domino gates in APCDP reduces the toggle rate in data paths since single-rail domino logic does not toggle when transferring low-voltage signal. Besides, the toggling rate relates to the injected data patterns Therefore, the energy consumption of APCDP varies a lot according to different data patterns. . On the other hand, the energy consumption of LP2/2-SR remains almost

constant. LP2/2-SR's full dual-rail domino data paths constant toggling rate regardless of the injected data patterns.

# 7) Robustness Analysis

APCDP has pipeline failure in the situation that a pipeline stage does not finish evaluating before its previous stage start precharge. In such situation, the pipeline stage cannot correctly finish evaluating because the precharge of its previous pipeline stage removes the valid data from the inputs. To avoid this pipeline failure, APCDP needs to satisfy an assumption that, in a pipeline stage,none of the other bits across the entire data paths is slower than the detected bit by more than the delay through a static NOR gate and the drive buffer chain following it. The robustness of APCDP is analyzed based on this assumption.

# IV COMPARISON OF DIFFERENT PARAMETERS OF THE PIPELINE STRUCTURES

Fig.13 shows the simulation result which is obtained using Tanner13with 0.65µm CMOS technology. PCHB is the timing robust pipeline design. It works in parallel processing. It will start the evaluation only if all the inputs are valid. It doesn't require any delay assumptions by the designer. The robustness comes at the expense of performance.



Figure 13. PCHB Waveform

Also the simulation results of PSO and APCDP are given in Fig.11 and Fig.15. We have done the analysis of these pipeline structures with different parameters such as Average power consumption, Static power, Power Delay Product and Energy Delay Product. Table II shows the comparison of parameters between PCHB, PSO and APCDP Structures. From this table we can conclude that the proposed pipeline method APCDP consumes less power when compared to the other methods. The analysis also shows that the energy which is required to construct the critical data path for APCDP is minimum.



Figure 14. PSO Waveform



Figure 15. APCDP Waveform

|              |           | -       |         |
|--------------|-----------|---------|---------|
| Parameter    | PCHB      | LP2/2   | APCDP   |
| Average      | 6.8714    | 2.913   | 2.368mw |
| Power        | mw        | mw      |         |
| Static power | 22.60 mw  | 9.41 mw | 7.88mw  |
| Power Delay  | 0.452 μws | 283.165 | 39.447  |
| Product(PDP) | _         | nws     | nws     |
| Power Delay  | 9.041 pws | 8.51195 | 0.1973  |
| Product(EDP) | _         | pws     | pws     |

# TABLE III COMPARISON OF PARAMETERS

#### **V CONCLUSION**

This paper introduced a novel design method of asynchronous domino logic pipeline. The pipeline is realized based on a constructed critical data path. The design method greatly reduces the overhead of handshake control logic as well as function block logic, which not only increases the pipeline throughput but also decreases the power consumption. The evaluation result shows that the proposed design has better performance than a bundled-data asynchronous domino logic pipeline.

#### REFERENCES

- [1] B. H. Calhoun, Y. Cao, X. Li, K. Mai, L. T. Pileggi, and R. A. Rutenbar, "Digital circuit design challenges and opportunities in the era of nanoscale CMOS," Proc. IEEE, vol. 96, no. 2, pp. 343–365,Feb. 2008.
- [2] J. Sparsø and S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective. Boston, MA, USA: Kluwer, 2001.
- [3] M. Krstic, E. Grass, F. K. Gurkaynak, and P.Vivet, "Globally asynchronous, locally synchronous circuits: Overview and outlook," IEEE Des.Test Comput., vol. 24, no. 5, pp. 430–441, Sep. /Oct. 2007.
- [4] A. J. Martin and M. Nystrom, "Asynchronous techniques for system on-chip design," Proc. IEEE, vol. 94, no. 6, pp. 1089–1120, Jun. 2006.
- [5] J. Teifel and R. Manohar, "An asynchronous dataflow FPGA architecture," IEEE Trans. Comput., vol. 53, no. 11, pp. 1376–1392, Nov. 2004.
- [6] H. S. Low, D. Shang, F. Xia, and A. Yakovlev, "Variation tolerant AFPGA architecture," in Proc. ASYNC, 2011, pp. 77–86.

- [7] M. Hariyama, S. Ishihara, and M. Kameyama, "Evaluation of a field programmable VLSI based on an asynchronous bitserial architecture," IEICE Trans. Electron., vol. E91-C, no. 9, pp. 1419–1426, Sep. 2008.
- [8] T. E. Williams, "Self-timed rings and their application to division," Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ., Stanford, CA, USA, Jun. 1991.
- [9] A. M. Lines, "Pipelined asynchronous circuits," Dept. Comput. Sci., California Inst. Technol., Pasadena, CA, USA, Tech. Rep., 1998.
- [10] S. M. Nowick and M. Singh, "High-performance asynchronous pipelines an overview," IEEE Des. Test Comput. vol. 28, no. 5, pp. 8–22, Sep.

/Oct. 2011.

- [11] M. Singh and S. M. Nowick, "The design of highperformance dynamic asynchronous pipelines: Look ahead style," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 11, pp. 1256–1269, Nov. 2007.
- [12] I. Sutherland, B. Sproull, and D. Harris, Logical Effort: Designing Fast CMOS Circuits. San Mateo, CA, USA: Morgan Kaufmann, 1999. [13] M. Singh, J. A. Tierno, A. Rylyakov, S. Rylov, and S. M. Nowick, "An adaptively pipelined mixed synchronous asynchronous digital FIR filter chip operating at 1.3 gigahertz," IEEE Trans. Very Large Scale Integr.(VLSI) Syst., vol. 18, no. 7, pp. 1043–1056, Jul. 2010.
- [14] Zheng fan Xia, Masanori Hariyama, and Michitaka Kameyama "Asynchronous Domino Logic Pipeline Design Based on Constructed Critical Data Path" IEEE Transactions on very Large Scale Integration (VLSI) Systems 2014, pp. 1063-8210.