muchen 牧辰

Building System-on-Chip

Updated 2018-03-07

Real systems are not built from scratch.

Real systems contain both hardware and software.

Real systems revolve around embedded processor (CPU that coordinates and CPUs are easy to program).

Real systems are designed using system-level design tools.

How do we design these systems?

In SoC, hardware if connected to software through a memory mapped interface or dedicated circuit. The SoC consists ofr a processor and a bunch of embedded cores / accelerators. The cores are connected using some standard bus.

An analogy is a gaming computer that has a graphics card: the CPU is still coordinating the GPU on what to render (vertices, shaders, etc).

page 6

Recall from FPGA architecture that there are soft and hard embedded processors. The soft processors are implemented using LUTs and other FPGA resources. The hard processors are actually a processing embedded in the system.

Off-chip processors can also be coupled to the FPGA

Soft Core

page 14

CPU Interface

CPUs connected to devices using “interconnect”. The simplest connection is “bus”.

There are tristate drivers for each device connected to the bus such that only one device drives it at a time. But as shown before, this is not really possible on FPGA since FPGA doesn’t have tristate drivers.

Modern SoCs uses “interconnect fabric”. But essentially has the same function: access any cores using some address space.

Master & Slaves

Most bus protocols draw a distinction between:

Most peripherals are slaves.

page 19

The CPU provides an address to address the connected peripherals. This means that any connected device will receive the address, but they have to see if the address is for that particular device. If it is, then it will listen.

The address is sent by the master which is received by 1 or more slaves.

page 22

In the interconnect fabric, instead of being a bus, there is just a tree of MUXs. The interconnect fabric in the DE1-SoC uses Altera Avalon. In their design, we can have multiple transactions happen at the same time.

Memory mapped Peripherals

As said before, each peripheral is mapped to a memory address. The peripheral will not respond to any calls to a different address that is not its own.

Interfacing with Hardware Accelerator

1. Create an interface between your hardware and Avalon fabric.

We need to make sure it is assigned to an memory address range that is accessible and not overlapping with other devices.

2. Attach hardware to a previously designed parallel I/O port

We just need to implement our own logic (combinatory logic) as a driver. The driver takes input from a parallel interface (PIO). The output is attached to a attached hardware.

In general, option 2 is much simpler to do. But option 1 is much more flexible.

Lab 5 - SoC Lab

The objective is to observe how SoCs are built in real life and become familiar with the system design tool (QSYS).


page 31-34

How do we program it?

There are two ways to write and debug software:

Super simple sample program:

#define Switches (volatile char *) 0x0002000
#define LEDs (char *) 0x0002010

void main() {
    while (1) {
        *LEDs = *Switches;

The hardware is mapped to the memory space in the processor. To read something, we make a read request to the switches via address 0x0002000. To write something, we make a write request to the LED memory address, which is 0x0002010.

All we’re doing in this sample code is constantly sampling the switches, then assigning on/off to the LEDs.

YAY! We just unlocked the ability to write software! Note that software is much slower and higher power than hardware.

The processor we’re using in this lab is much more customizable and can be tuned to exact needs, but is much slower (only up to 100MHz). Hence we need to hardware accelerator to do certain tasks.

Example 1: Implementing a Slave Device

We want: a circuit to determine if a number is prime

  1. Define hardware / software interface

    page 43

    The software writes the number into location 0, this starts the computation of in the hardware. The computation may take multiple cycles. When the computation is completed, done (another address in the memory space) is set to 1. The prime flag is also asserted in the memory space.

    Note that it is not necessarily always writing to memory, it’s just a piece of data that’s passing via the interconnect fabric.

    The software has to poll for done going high.

  2. Define hardware that makes up the core

    Because the hardware is a slave, the implementation is straight forward.

    page 47

    Note that we are not writing memory, we are pretending to be the memory.

    page 48

  3. Write the software to interact with it

    #define MY_ACCEL_BASE (volatile int *) 0x0002040
    #define num 12973
    void main() {
        // Write the number to location 0 in our memory chip
        // But we're actually not writing to memory
        // We're just sending the request via the bus
    	*(MY_ACCEL_BASE + 0) = num;
        // Keep looping if not done
        while ((*(MY_ACCEL_BASE + 2)) == 0);
        // Read if prime
        prime = 0;
        if (*(MY_ACCEL_BASE + 1))
            prime = 1;

    Variables in C can either end up in registers or memory. Using volatile will ensure the variable is mapped to the register. Otherwise, it might be mapped to a register and thus no transaction on the bus.

    What will happen if the volatile on line 1 is removed?

    The MY_ACCEL_BASE might be put into a register, and thus the while loop might become infinite loop because it would just read the register over and over again.

Example 2: Master and Slave

We want: an accelerator that draws a box in the pixel buffer

The accelerator must be a slave because the processor can write and read values to the control registers

The accelerator must also be a master because the pixel buffer is stored in memory and the accelerator must initiate.

Memory map:

page 52

The slave interface would implement the registers and interface to the Avalon fabric.

Other Way of Hardware Acceleration

page 60

One could create a custom instruction that is connected to some custom logic.

A question to ask is: “is it work accelerate?”. If, without loss of generality, 80% of the time the system is executing 10% of the code such as a loop, then consider implementing one.

What are the limitations?

  1. In most cases, it takes more time to transfer the data to accelerator than using it to execute code
  2. Amdahl’s Law

Amdahl’s Law

Speedup is limited by the amount of code that we cannot speed up. Suppose \(P\) is the fraction of execution you can speed up, and \(S\) is the amount that we CAN speed up. Then the expression of for how much speedup there is is given as:


For example, if 25% of the code executed can run twice as fast, then \(P=0.25\), \(S=2\) and the overall speed up is \(14\%\).