muchen 牧辰

Optimization and FPGA Architecture

Updated 2018-04-10


We now have enough Verilog knowledge to design anything. How do we know if it’s optimal?

Some of the things we can optimize are:

FPGA Architecture

Before looking at FPGA architecture and compilation, understand other forms of digital design.

Fully Custom Integrated Circuits

Gates are implemented using MOS transistors. They are connected using metal wires on the chip. Note that FC chips are different from ASIC chips, which is what most industries use.

page 7

The bottom layer are transistors that contain Silicon. Everything above is mostly copper and used for connections. Short range wires will be on the bottom and are thinner, the longer ranged wires are on top.

Full custom chips lay all of this out manually to maximize optimization. Unsurprisingly, this is not very economically sustainable.

Standard Cell Design

FC chips are too tedious. Let us abstract a bit.

There are cells that have different width but the same height laid out in rows. There exists libraries of cells that implement different gates.

page 9

Standard cell design is used only for all high speed, low power applications. It is very expensive and time consuming to design and build. However, it is still more economical than FC chips.

FPGA Compilation

RTL: Register transfer language

After synthesis, a netlist is created which represents the circuit of logic gates. A netlist consists of a bunch of gates and registers and somehow connected via wires or busses.

At this point, we still don’t know the placement and routing of the components. The placement is the act of placing gates onto the FPGA. Hence the placement and routing part of the compilation.

Lastly, we have a timing analysis. This tests if the design fit within the timing constraints. Once we place and routed the signal, we know exactly how long the delays due to routing is.

Optimization involves repetition of placement and routing and minimizing a cost function that returns the timing (ie. gradient descent).

FPGA CAD and Architecture

Inside FPGA

page 18

Inside FPGA, there are a lot of “boxes” called logic blocks. They are used to implement logic and flip-flips. They contain a LUT (kind of like memory).

On the outside, there are I/O blocks that interface off-chip and usually can support many I/O standard. This is because FPGAs are widely used as interface between two hardware that doesn’t interface well.

page 20

If one block wants to communicate with another block, then the router will route a wire (track) between the blocks. Thus, one can see the speed also depends on the distance of interconnection between the logic blocks.


There are two parts to synthesis:

  1. Map behavior to logic equations / gates
  2. Map gates to FPGA logic blocks (this is FPGA-specific)

Part 1: Behavior to Gates

Each process in the design is compiled independently. This makes sense because always blocks is completely parallel.

Part 2: Gates to ALM

page 25

There exists some four-input LUT inside the ALM. This is the “logic part” and can be used to construct any logic function. Note that a four-input LUT can “simulate” any logic function that takes four inputs. Similarly, an eight-input LUT can “simulate” any logic function that takes eight inputs.

There is a carry-in signal from output of a previous block (similar to an adder). The route of LUT can be connected to a flip-flop or connect multiple logic blocks together to form a more complicated module.

There is some control signals like sload allows the input of the register to be directly from the data inputs.

The output is a collection of MUXs that can route signals to neighboring blocks or some other logic blocks farther away. The sel signals to these MUXs (that has some storage pieces) are programmed via compile-time, and don’t change in runtime. This happens when we’re programming the FPGA; the state of these MUXs are sent via the .sof bitstream.

page 26

In a more complicated FPGA, there are adaptive LUT, which allows it to be split into different configurations (such as a bunch of four-input LUT, and some other LUTs. But they still share the same data input). There are more registers and MUXs. There are also full-adders in each logic block.

How do we design the LUT so we can program it later?

Just use memory. The LUT just contains memory.

page 28

There are memory cells exist at the input signals of the MUX. The select signal just output the memory that is being selected. In fact, part of the bitstream goes to the memory inside the LUT. The select signals are basically addresses to the memory inside the LUT.

Fact: A LUT with K inputs can implement any function with up to K inputs

This is because the function is injective and can be proved using the Pigeonhole Principle

Flip Flops

The flip flop can be enabled by changing the multiplexer bit page 34. To use flip flop only, just apply an identity function to the LUT. This is also known as unity combinational logic.

page 38 example: need 7 logic blocks because the LUT only accept 2 inputs but we have a 3-way OR-gate.

Fracturable LUTs are LUTs that can be reconfigured. For instance if we have a signal that is fed into two logic blocks, then it would be more economical to split a large LUT into smaller ones so we don’t need to use a different logic block.

More Blocks

In most modern FPGAs, besides the ALM, there are also some different blocks.


Simulating memory using FPGA ALMs using memory is terribly inefficient. Using dedicated memory cells is not only cheaper, but also can be packed more densely.

The embedded memory blocks can vary in size. For the DE1-SoC, we have about 4MBits.

The memory blocks can be configured to several modes such as 128x36, 128x32, 256,x16, etc. This means that the configuration is split onto # of words x # bits per word

Multiplier Blocks

Dedicated to multiplying things. Used especially for graphics and machine learning applications.

To use, just specify * in Verilog:

assign c = a * b;

Digital Signal Processor (DSP) Blocks


After synthesis, Quartus will place and route the logic onto FPGA. Initially, the initial placement is random, but there is a lot of congestion and not very optimized. The optimized placement places the components that should be closer together closer together to minimize distance.

For a small design, this process is pretty much instant. For larger designs, this can take hours since the complexity of the routing algorithm is polynomial.