# Introduction of Processor Design for AI Applications 

## LO5 - Retimingand Pipelining

Pengju Ren

Institute of Artificial Intelligence and Robotics
Xi'an Jiaotong University
http://gr.xjtu.edu.cn/web/pengjuren

## LTI Systems

■ Linear systems
$\square$ Assume $x 1(n)->y 1(n)$ and $x 2(n)->y 2(n)$, where " $\rightarrow$ " denotes "lead to". If ax1 (n)+bx2 (n) -> ay1 (n)+by2 (n), then the systems is referred to as a "Linear System."
$\square$ Homogenous and additive properties
$\square$ Time-invariant (TI) systems
$\square x(n-n 0)->y(n-n 0)$
■ LTI systems
$\square y(n)=h(n) * x(n)$

- Causal systems
$\square \mathrm{y}\left(\mathrm{n}_{0}\right)$ depends only on $\mathrm{x}(\mathrm{n})$, where $\mathrm{n}<=\mathrm{n}_{0}$


## Pipelining of FIR Digital Filters (1)

Leads to a reduction in the critical path


The critical path (or the minimum time required for processing a new sample) is limited by 1 multiply and 2 add times. Thus, the "sample period" (or the "sample frequency" ) is given by:

$$
\begin{gathered}
T_{\text {sample }} \geq T_{M}+2 T_{A} \\
f_{\text {sample }} \leq \frac{1}{T_{M}+2 T_{A}}
\end{gathered}
$$

## Pipelining of FIR Digital Filters (2)

Pipelining: reduce the effective critical path by introducing latches along the critical data path



| Clock | Input |  | Node1 | Node2 | Node3 |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | $x(n)$ | $a x(n)+\mathrm{b} x(n-1)$ | $a x(n-1)+\mathrm{b} x(n-2)$ | $c x(n-3)$ | $y(n-2)$ |
| 1 | $x(n+1)$ | $a x(n+1)+\mathrm{b} x(n)$ | $a x(n)+\mathrm{b} x(n-1)$ | $c x(n-2)$ | $y(n-1)$ |
| 2 | $x(n+2)$ | $a x(n+2)+\mathrm{b} x(n+1)$ | $a x(n+1)+\mathrm{b} x(n)$ | $c x(n-1)$ | $y(n)$ |
| 3 | $x(n+3)$ | $a x(n+3)+\mathrm{b} x(n+2)$ | $a x(n+2)+\mathrm{b} x(n+1)$ | $c x(n)$ | $y(n+1)$ |

## Pipelining of FIR Digital Filters (3)

Pipelining reduces the critical path, but leads to a penalty in terms of an increased latency and the number of latohes.
This longest path or the "critical path" can be reduced by suitably placing the pipelining latches in the DSP architecture - The pipelining latches can only be placed across any feed-forward cutset of the graph

Cutset: a set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint

Feed-forward cutset: data move in the forward direction on all the edge of the cutset

## SFG representation of Cutset


(0)


## Fine-grain Pipelining of FIR Filter



Let $\boldsymbol{T}_{\boldsymbol{M}}=10$ units and $\boldsymbol{T}_{\boldsymbol{A}}=2$ units. If the multiplier is broken into 2 smaller units ( $\boldsymbol{T}_{\boldsymbol{M} 1}$ and $\boldsymbol{T}_{\boldsymbol{M} \mathbf{2}}$ ) with processing times of 5 units, respectively (by placing the latches on the horizontal cutset across the multiplier), then the desired clock period can be achieved as 7 units

## Retiming and Example (1)

A transformation technique used to change the location of delay elements in the circuit without affecting the input/output characteristics, changing 1) the clock period and 2) the number of registers
Node Retiming


Cutset Retiming


## Retiming and Example (2)

(1)


$$
\begin{aligned}
w(n) & =a y(n-1)+b y(n-2) \\
y(n) & =w(n-1)+x(n) \\
& =a y(n-2)+b y(n-3)+x(n)
\end{aligned}
$$

$$
\begin{aligned}
& w_{1}(n)=a y(n-1) \\
& w_{2}(n)=b y(n-2) \\
& \begin{aligned}
y(n) & =w_{1}(n-1)+w_{2}(n-1)+x(n) \\
& =a y(n-2)+b y(n-3)+x(n)
\end{aligned}
\end{aligned}
$$

Critical Path: $T_{M}+T_{A}=3$ u.t
Number of registers: 4
Critical Path: $\mathbf{2 T}_{A}=2$ u.t
Number of registers: 5

## Retiming Formulation (1)

Cutset retiming only affects the weights of edges in the cutset. If the 2 disconnected subgraphs are labeled $G_{1}$ and $G 2$ then cutset retiming consists of adding $k$ delays to each edge from $G_{1}$ to $G_{2}$, and removing $k$ delays to each edge from $G_{2}$ to $G_{1}$.


## Retiming Formulation (2)

Map circuit $G->G_{r}$ using retiming
Retiming can be presented with $r(X), X$ is one of the nodes in the circuit
For an edge $\mathrm{U} \xrightarrow{e} \mathrm{~V}$ ( U is Source, V is the destination)

$$
w^{\prime}(e)=w(e) \oplus r(V)-r(U)
$$

$w(e)$ : weight (delay) of the edge $\mathrm{U} \xrightarrow{e} \mathrm{~V}$ in the origin circuit $w^{\prime}(e)$ : weight of the edge $\mathrm{U} \stackrel{e}{\rightarrow} \mathrm{~V}$ in the retimed circuit

A retiming solution is feasible if : $w^{\prime}(\boldsymbol{e}) \geq \mathbf{0}, \forall \boldsymbol{e} \epsilon \boldsymbol{G}$

$$
w(e)+r(V)-r(U) \geq \mathbf{0} \Rightarrow r(U)-r(V) \leq w(e)
$$

One inequality per edge due to the causality of the system

## Retiming Formulation: cutset contains 1 node



If cutset $=\left\{x_{i}\right\}$ : we set $\mathrm{r}\left(x_{i}\right)=\left\{\begin{array}{ll}\mathrm{k} & \text { if } x=x_{i} \\ 0, & \text { pthers }\end{array}\right.$ x are nodes in $G$
Therefore, for an edge $U \stackrel{e}{\longrightarrow} \operatorname{Vin} G(U$ is Source, $V$ is destination

$$
\begin{aligned}
& w^{\prime}(e)=w(e)+r(V)-r(U)= \begin{cases}k, & \text { if } \mathrm{V}=x_{i} \\
-k, & \text { if } \mathrm{U}=x_{i} \\
0, & \text { others }\end{cases} \\
& \frac{w(e)+k,}{} \text { if } \mathrm{V}=x_{i} \\
& \frac{w(e)-k,}{} \text { if } \mathrm{U}=x_{i}
\end{aligned} \quad \text { In-degree edges of } x_{i} . \text { Out-degree edges of } x_{\boldsymbol{i}}
$$

## Retiming Formulation: cutset contains one node


$w^{\prime}(1 \rightarrow 3)=1+0-0=1$
$r\left(x_{i}\right)=\left\{\begin{array}{l}0, \text { if } x_{i} \in \mathrm{G}_{1} \\ k, \text { if } x_{i} \in \mathrm{G}_{2}\end{array} \Rightarrow \begin{array}{l}r(1)=0 \\ r(2)=k \\ r(3)=0 \\ r(4)=0\end{array} \Rightarrow \begin{array}{l}w^{\prime}(1 \rightarrow 4)=2+0-0=2 \\ w^{\prime}(2 \rightarrow 1)=1+0-k=0 \\ w^{\prime}(3 \rightarrow 2)=0+k-0=1 \\ w^{\prime}(4 \rightarrow 2)=0+k-0=1\end{array} \rightarrow G_{2}>G_{1}:-\mathbf{1}\right.$
Feasible solution of cutset retiming: $-\min \left(w(e), e \in \boldsymbol{G}_{1}>\boldsymbol{G}_{2}\right) \leq \mathrm{k} \leq \min \left(w(e), e \in \boldsymbol{G}_{2}->\boldsymbol{G}_{1}\right)$

## Retiming Formulation: cutset contains many nodes


$G_{2}$

$$
r\left(x_{i}\right)=\left\{\begin{array}{l}
0, \text { if } x_{i} \in \mathrm{G}_{1} \\
k, \text { if } x_{i} \in \mathrm{G}_{2}
\end{array} \Rightarrow \begin{array}{l}
r(1)=0 \\
r(2)=k \\
r(3)=0 \\
r(4)=k
\end{array} \Rightarrow \begin{array}{l}
w^{\prime}(2 \rightarrow 1)=1+0-k=0 \\
w^{\prime}(1 \rightarrow 4)=2+k-0=3 \\
w^{\prime}(3 \rightarrow 2)=0+k-0=1 \\
w^{\prime}(4 \rightarrow 2)=0+k-k=0
\end{array}\right\} \boldsymbol{G}_{1}->\boldsymbol{G}_{2}:+\mathbf{1}
$$

Feasible solution of cutset retiming:

$$
-\min \left(w(e), e \in \boldsymbol{G}_{\boldsymbol{1}^{-}}>\boldsymbol{G}_{2}\right) \leq \mathrm{k} \leq \min \left(w(e), e \in \boldsymbol{G}_{2}->\boldsymbol{G}_{\boldsymbol{1}}\right)
$$

## Properties of Retiming

The weight of the retimed path:

$$
p=\mathrm{V}_{0} \xrightarrow{e_{0}} \mathrm{~V}_{1} \xrightarrow{e_{1}} \ldots \xrightarrow{e_{m-1}} \mathrm{~V}_{m} \text { is given by } w^{\prime}(p)=w(p)+\mu\left(\mathrm{V}_{m}\right)-r\left(\mathrm{~V}_{0}\right)
$$

$$
\text { Proof: } \begin{aligned}
w^{\prime}(p) & =\sum_{i=0}^{m-1} w^{\prime}\left(e_{i}\right) \\
& =\sum_{i=0}^{m-1}\left(w\left(e_{i}\right)+r\left(\mathrm{~V}_{i+1}\right)-r\left(\hat{V}_{i}\right)\right) \\
& =\sum_{i=0}^{m-1} w\left(e_{i}\right)+\left(\sum_{i=0}^{m-1} r\left(\mathrm{~V}_{i+1}\right)-\sum_{i=0}^{m-1} r\left(\mathrm{~V}_{i}\right)\right) \\
& \left.=w(p)+r\left(\mathrm{~V}_{m}\right)-\mathrm{r} \mathrm{~V}_{0}\right)
\end{aligned}
$$

- Retiming does not change the number of delays in a cycle
- Retiming dees not alter the iteration bound in a DFG
- Adding the constant value $j$ to the retiming value of each node does not change the mapping from $G$ to $G_{r}$

$$
\left.w^{\prime}(e)=w(e)+(r(V)+j)-(r(U)+j)\right)=w(e)+r(V)-r(U)
$$

## Solving Systems of Inequalities (1)

- Given a set of M equalities in N variables, use shortest path algorithm to solve the results

Step 1: draw a constraint graph

- Draw the node $x_{i}$ for each of the N variables $x_{i}, \mathrm{i}=1,2, \ldots, \mathrm{~N}$

■ Draw the node $x_{N+1}$

- For each inequality $\mathrm{r}\left(x_{i}\right)-\mathrm{r}\left(x_{j}\right) \leq k$, draw the edge $x_{j} \rightarrow x_{i}$ from the node $x_{j}$ to node $x_{i}$ with length $k$
- For each node $x_{i}, \mathrm{i}=1,2, \ldots \mathrm{n}$, draw the edge $x_{N+1} \rightarrow x_{i}$ from the node $x_{N+1}$ to the node $x_{i}$ with length 0


## Solving Systems of Inequalities (2)

Step 2: Solve using a shortest path algorithm

- The system of inequalities has a solution if and only if the constraint graph contains no negative cycles
- If a solution exists, one solution is where $\mathrm{r}\left(x_{i}\right)$ is the minimum-length path from the node $x_{N+1}$ to the node $x_{i}$


## Example

$$
\begin{gathered}
r_{1}-r_{2} \leq 0 \\
r_{3}-r_{1} \leq 5 \\
r_{4}-r_{1} \leq 4 \\
r_{4}-r_{3} \leq-1 \\
r_{3}-r_{2} \leq 2
\end{gathered}
$$



Bellman-Ford shortest path algorithm:

$$
\begin{aligned}
& R^{(6)}
\end{aligned}=\left[\begin{array}{ccccc}
\infty & \infty & 5 & 4 & \infty \\
0 & \infty & 2 & 1 & \infty \\
\infty & \infty & \infty & -1 & \infty \\
\infty & \infty & \infty & \infty & \infty \\
0 & 0 & 0 & -1 & \infty
\end{array}\right]
$$

$$
r_{1}=0, r_{2}=0, r_{3}=0, r_{4}=-1
$$

## Cutset Retiming and Slow-down (Lattice Filter)



1) Replace each delay with $\mathbf{N}$ delays to create $\mathbf{N}$-slow DFG
2) Perform cutset retiming on the N-slow DFG

## Systolic Transformation



Source: L. D. Van, "A new 2-D systolic digital filter architecture without global broadcast," IEEE Trans. VLSI Systs., vol. 10, pp. 477-486, Aug. 2002

## The Usage of Retiming Techniques

■ Cutset retiming and pipelining

- Retiming for clock period minimization
- Retiming for register minimization


## Retiming for clock period minimization (1)

- Minimum feasible clock period or critical path:

$$
\Phi(G)=\max \{t(\boldsymbol{p}): U \xrightarrow{p} V, \mathrm{w}(\boldsymbol{p})=0\}
$$

Introduce two quantities $W(U, V)$ and $D(U, V)$ to determine if there is a retiming solution that can achieve a desired clock period :
$\square$ Minimum number of registers of $U \xrightarrow{P}(V) W(U, V)=\min \{w(\boldsymbol{p}): U \xrightarrow{P} V\}$
$\square$ Maximum computation time of $U \rightarrow V$ with minimum number of registers:

$$
D(U, V)=\max \{t(\boldsymbol{p}): U \xrightarrow{\boldsymbol{P}} V \text { and } w(\boldsymbol{p})=W(U, V)\}
$$



## Retiming for clock period minimization (2)

Algorithm to compute $W(U, V)$ and $D(U, V)$

1. Let $M=t_{\max } n$, where $t_{\max }$ is the maximum computation time of the nodes in $G$ and $n$ is the \# of nodes in $G$.
2. Form a new graph $G^{\prime}$ which is the same às $G$ except the edge weights are replaced by $w^{\prime}(e)=M W(e)-t(U)$ for all edges $U \rightarrow V$.
3. Solve for all pair shortest path problem on $G^{\prime}$ by using Floyd Warshall algorithm. Let $S_{U V}^{\prime \prime}$ be the shortest path form $U \xrightarrow{P} V$.
4. If $U \neq V$, then $W(U), V)=\left\lceil S_{U V}^{\prime} / M\right\rceil$ and
$D(U, V)=M W(U, V)-S_{U V}^{\prime}+t(V)$. If $U=V$, then $W(U, V)=0$ and $D(U, V)=t(U)$

## Retiming for clock period minimization (3)

(1)

(2)

(2)

Step1: $M=t_{\text {max }} n=2 \times 4=8$
Step2: New G' with $w^{\prime}(e)=M w(e)-t(U)$
Step3: pair shortest path problem on $G^{\prime}$

| $S_{U V}^{\prime}$ | 1 | 2 | 3 | 4 | $W(U, V)$ | 1 | 2 | 3 | 4 | $D(U, V)$ | 1 | 2 | 3 | 4 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 12 | 5 | 7 | 15 | 1 | 0 | 1 | 1 | 2 | 1 | 1 | 4 | 3 | 3 |
| 2 | 7 | 12 | 14 | 22 | 2 | 1 | 0 | 2 | 3 | 2 | 2 | 1 | 4 | 4 |
| 3 | 5 | -2 | 12 | 20 | 3 | 1 | 0 | 0 | 3 | 3 | 4 | 3 | 2 | 6 |
| 4 | 5 | -2 | 12 | 20 | 4 | 1 | 0 | 2 | 0 | 4 | 4 | 3 | 6 | 2 |

## Retiming for clock period minimization (4)

- Feasibility constraints
$w^{\prime}(e) \geq 0 \Rightarrow$ Causality of the system $=>w(e) \geq r(U)-r(V)$, $\forall e ~ U \xrightarrow{e} V$

$$
\begin{aligned}
& r(1)-r(3) \leq 1 \\
& r(1)-r(4) \leq 2 \\
& r(2)-r(1) \leq 1 \\
& r(3)-r(2) \leq 0 \\
& r(4)-r(2) \leq 0
\end{aligned}
$$

■ Critical Path constraints
$r(U)-r(V) \leq W(U, V)-1$ for all vertices $U$ and $V$ in $G$ such that

$D(U, V) \lessgtr c$, where $c=$ target clock period (if $c=3$ )

$$
\begin{array}{ll}
r(1)-r(2) \leq 0 & r(3)-r(4) \leq 2 \\
r(2)-r(3) \leq 1 & r(4)-r(1) \leq 0 \\
r(2)-r(4) \leq 2 & r(4)-r(3) \leq 1 \\
r(3)-r(1) \leq 0 &
\end{array}
$$

## Retiming for clock period minimization ( $\mathrm{c}=3$ )

Using $W(U, V)$ and $D(U, V)$ the feasibility and critical path constraints are formulated to give certain inequalities.
The inequalities are solved using constraint graphs and if a feasible solution is obtained then the circuit can be clocked with a period ' $c$ '
$\square$ Feasibility constraints

$$
\begin{aligned}
& r(\mathbf{1})-r(3) \leq \mathbf{1} \\
& r(\mathbf{1})-r(\mathbf{4}) \leq \mathbf{2} \\
& r(\mathbf{2})-r(\mathbf{1}) \leq \mathbf{1} \\
& r(\mathbf{3})-r(\mathbf{2}) \leq \mathbf{0} \\
& r(\mathbf{4})-r(\mathbf{2}) \leq 0
\end{aligned}
$$

- Critical Path constraints ( $c=3$ )

$$
\begin{aligned}
& r(1)-r(2) \leq 0 \\
& r(2)-r(3) \leq 1 \\
& r(2)-r(4) \leq 2 \\
& r(3)-r(1) \leq 0 \\
& r(3)-r(4) \leq 2 \\
& r(4)-r(1) \leq 0 \\
& r(4)-r(3) \leq 1
\end{aligned}
$$



Constraint graph

$$
r(1)=r(2)=r(3)=r(4)=0
$$

## Retiming for clock period minimization ( $\mathrm{c}=2$ )

Using $W(U, V)$ and $D(U, V)$ the feasibility and critical path constraints are formulated to give certain inequalities.
The inequalities are solved using constraint graphs and if a feasible solution is obtained then the circuit can be clocked with a period ' $c$ '
$\square$ Feasibility constraints


## General Approach (Retiming for clock period minimization)

1. Compute $W(U, V)$ and $D(U, V)$
2. Sort the values of $D(U, V)$
3. Perform binary search on these values to find the retiming solution with the minimum clock period.

## Retiming for Register Minimization(1)

If a node has several output edges carrying the same signal, the number of registers required to implement these edges is the maximum number of registers on any one of the edges.


## Retiming for Register Minimization(2)

Minimize $\operatorname{COST}=\sum R_{X}$ subject to

1. (fanout constraint) $R_{X} \geq w_{r}(e)$ for all $X$ and all edges $X \xrightarrow{e}$ ?
2. (feasibility constraint) $r(U)-C(V) \leq w(e)$ for every edge $U \xrightarrow{e} V$
3. (clock period constraint) $r(U)-r(V) \leq W(U, V)-1$ for all vertices $U, V$ such that $D(U, V)>c$.

## Retiming and Pipelining

Pipelining is a special case of cutset retiming where there are no edges in the cutset from G2 to G1, i.e., pipelining applies to graphs without loops
Pipelining is Equivalent to Introducing Many delays at the Input (or output) followed by Retiming


# Next Lecture: ParallelAArchitecture \& 

## Resource Sharing

