Building System-on-Chip

Updated 2018-03-07

Soft Core
CPU Interface
- Master & Slaves
- Memory mapped Peripherals
Interfacing with Hardware Accelerator
- 1. Create an interface between your hardware and Avalon fabric.
- 2. Attach hardware to a previously designed parallel I/O port
Lab 5 - SoC Lab
Example 1: Implementing a Slave Device
Example 2: Master and Slave
Other Way of Hardware Acceleration
- Amdahl’s Law

Real systems are not built from scratch.

Real systems contain both hardware and software.

Real systems revolve around embedded processor (CPU that coordinates and CPUs are easy to program).

Real systems are designed using system-level design tools.

How do we design these systems?

In SoC, hardware if connected to software through a memory mapped interface or dedicated circuit. The SoC consists ofr a processor and a bunch of embedded cores / accelerators. The cores are connected using some standard bus.

An analogy is a gaming computer that has a graphics card: the CPU is still coordinating the GPU on what to render (vertices, shaders, etc).

page 6

Recall from FPGA architecture that there are soft and hard embedded processors. The soft processors are implemented using LUTs and other FPGA resources. The hard processors are actually a processing embedded in the system.

Off-chip processors can also be coupled to the FPGA

Soft Core

page 14

CPU Interface

CPUs connected to devices using “interconnect”. The simplest connection is “bus”.

There are tristate drivers for each device connected to the bus such that only one device drives it at a time. But as shown before, this is not really possible on FPGA since FPGA doesn’t have tristate drivers.

Modern SoCs uses “interconnect fabric”. But essentially has the same function: access any cores using some address space.

Master & Slaves

Most bus protocols draw a distinction between:

Masters: initiate transaction, specify address, give instructions, etc.
Slaves: respond to requests from masters and can generate data

Most peripherals are slaves.

page 19

The CPU provides an address to address the connected peripherals. This means that any connected device will receive the address, but they have to see if the address is for that particular device. If it is, then it will listen.

The address is sent by the master which is received by 1 or more slaves.

page 22

In the interconnect fabric, instead of being a bus, there is just a tree of MUXs. The interconnect fabric in the DE1-SoC uses Altera Avalon. In their design, we can have multiple transactions happen at the same time.

Memory mapped Peripherals

As said before, each peripheral is mapped to a memory address. The peripheral will not respond to any calls to a different address that is not its own.

Interfacing with Hardware Accelerator

1. Create an interface between your hardware and Avalon fabric.

We need to make sure it is assigned to an memory address range that is accessible and not overlapping with other devices.

2. Attach hardware to a previously designed parallel I/O port

We just need to implement our own logic (combinatory logic) as a driver. The driver takes input from a parallel interface (PIO). The output is attached to a attached hardware.

In general, option 2 is much simpler to do. But option 1 is much more flexible.

Lab 5 - SoC Lab

The objective is to observe how SoCs are built in real life and become familiar with the system design tool (QSYS).

Experience:

Implement and configure processor of FPGA
program processors
Program the interconnect fabric
Interface hardware and software

page 31-34

How do we program it?

There are two ways to write and debug software:

Altera Monitor Program
Eclipse

Super simple sample program:

#define Switches (volatile char *) 0x0002000
#define LEDs (char *) 0x0002010

void main() {
    while (1) {
        *LEDs = *Switches;
    }
}

The hardware is mapped to the memory space in the processor. To read something, we make a read request to the switches via address 0x0002000. To write something, we make a write request to the LED memory address, which is 0x0002010.

All we’re doing in this sample code is constantly sampling the switches, then assigning on/off to the LEDs.

YAY! We just unlocked the ability to write software! Note that software is much slower and higher power than hardware.

The processor we’re using in this lab is much more customizable and can be tuned to exact needs, but is much slower (only up to 100MHz). Hence we need to hardware accelerator to do certain tasks.

Example 1: Implementing a Slave Device

We want: a circuit to determine if a number is prime

Define hardware / software interface

page 43

The software writes the number into location 0, this starts the computation of in the hardware. The computation may take multiple cycles. When the computation is completed, done (another address in the memory space) is set to 1. The prime flag is also asserted in the memory space.

Note that it is not necessarily always writing to memory, it’s just a piece of data that’s passing via the interconnect fabric.

The software has to poll for done going high.
Define hardware that makes up the core

Because the hardware is a slave, the implementation is straight forward.

page 47

Note that we are not writing memory, we are pretending to be the memory.

page 48
Write the software to interact with it
```
#define MY_ACCEL_BASE (volatile int *) 0x0002040
#define num 12973

void main() {

    // Write the number to location 0 in our memory chip
    // But we're actually not writing to memory
    // We're just sending the request via the bus
	*(MY_ACCEL_BASE + 0) = num;

    // Keep looping if not done
    while ((*(MY_ACCEL_BASE + 2)) == 0);

    // Read if prime
    prime = 0;
    if (*(MY_ACCEL_BASE + 1))
        prime = 1;
}
```
Variables in C can either end up in registers or memory. Using volatile will ensure the variable is mapped to the register. Otherwise, it might be mapped to a register and thus no transaction on the bus.

What will happen if the volatile on line 1 is removed?

The MY_ACCEL_BASE might be put into a register, and thus the while loop might become infinite loop because it would just read the register over and over again.

Example 2: Master and Slave

We want: an accelerator that draws a box in the pixel buffer

The accelerator must be a slave because the processor can write and read values to the control registers

The accelerator must also be a master because the pixel buffer is stored in memory and the accelerator must initiate.

Memory map:

page 52

The slave interface would implement the registers and interface to the Avalon fabric.

Other Way of Hardware Acceleration

page 60

One could create a custom instruction that is connected to some custom logic.

A question to ask is: “is it work accelerate?”. If, without loss of generality, 80% of the time the system is executing 10% of the code such as a loop, then consider implementing one.

What are the limitations?

In most cases, it takes more time to transfer the data to accelerator than using it to execute code
Amdahl’s Law

Amdahl’s Law

Speedup is limited by the amount of code that we cannot speed up. Suppose \(P\) is the fraction of execution you can speed up, and \(S\) is the amount that we CAN speed up. Then the expression of for how much speedup there is is given as:

\[\frac{1}{(1-P)+\frac{P}{S}}\]

For example, if 25% of the code executed can run twice as fast, then \(P=0.25\), \(S=2\) and the overall speed up is \(14\%\).