Updated 2018-03-07
Real systems are not built from scratch.
Real systems contain both hardware and software.
Real systems revolve around embedded processor (CPU that coordinates and CPUs are easy to program).
Real systems are designed using system-level design tools.
How do we design these systems?
In SoC, hardware if connected to software through a memory mapped interface or dedicated circuit. The SoC consists ofr a processor and a bunch of embedded cores / accelerators. The cores are connected using some standard bus.
An analogy is a gaming computer that has a graphics card: the CPU is still coordinating the GPU on what to render (vertices, shaders, etc).
page 6
Recall from FPGA architecture that there are soft and hard embedded processors. The soft processors are implemented using LUTs and other FPGA resources. The hard processors are actually a processing embedded in the system.
Off-chip processors can also be coupled to the FPGA
page 14
CPUs connected to devices using “interconnect”. The simplest connection is “bus”.
There are tristate drivers for each device connected to the bus such that only one device drives it at a time. But as shown before, this is not really possible on FPGA since FPGA doesn’t have tristate drivers.
Modern SoCs uses “interconnect fabric”. But essentially has the same function: access any cores using some address space.
Most bus protocols draw a distinction between:
Most peripherals are slaves.
page 19
The CPU provides an address to address the connected peripherals. This means that any connected device will receive the address, but they have to see if the address is for that particular device. If it is, then it will listen.
The address is sent by the master which is received by 1 or more slaves.
page 22
In the interconnect fabric, instead of being a bus, there is just a tree of MUXs. The interconnect fabric in the DE1-SoC uses Altera Avalon. In their design, we can have multiple transactions happen at the same time.
As said before, each peripheral is mapped to a memory address. The peripheral will not respond to any calls to a different address that is not its own.
We need to make sure it is assigned to an memory address range that is accessible and not overlapping with other devices.
We just need to implement our own logic (combinatory logic) as a driver. The driver takes input from a parallel interface (PIO). The output is attached to a attached hardware.
In general, option 2 is much simpler to do. But option 1 is much more flexible.
The objective is to observe how SoCs are built in real life and become familiar with the system design tool (QSYS).
Experience:
page 31-34
How do we program it?
There are two ways to write and debug software:
Super simple sample program:
#define Switches (volatile char *) 0x0002000
#define LEDs (char *) 0x0002010
void main() {
while (1) {
*LEDs = *Switches;
}
}
The hardware is mapped to the memory space in the processor. To read something, we make a read request to the switches via address 0x0002000
. To write something, we make a write request to the LED memory address, which is 0x0002010
.
All we’re doing in this sample code is constantly sampling the switches, then assigning on/off to the LEDs.
YAY! We just unlocked the ability to write software! Note that software is much slower and higher power than hardware.
The processor we’re using in this lab is much more customizable and can be tuned to exact needs, but is much slower (only up to 100MHz). Hence we need to hardware accelerator to do certain tasks.
We want: a circuit to determine if a number is prime
Define hardware / software interface
page 43
The software writes the number into location 0, this starts the computation of in the hardware. The computation may take multiple cycles. When the computation is completed, done
(another address in the memory space) is set to 1. The prime
flag is also asserted in the memory space.
Note that it is not necessarily always writing to memory, it’s just a piece of data that’s passing via the interconnect fabric.
The software has to poll for done
going high.
Define hardware that makes up the core
Because the hardware is a slave, the implementation is straight forward.
page 47
Note that we are not writing memory, we are pretending to be the memory.
page 48
Write the software to interact with it
#define MY_ACCEL_BASE (volatile int *) 0x0002040
#define num 12973
void main() {
// Write the number to location 0 in our memory chip
// But we're actually not writing to memory
// We're just sending the request via the bus
*(MY_ACCEL_BASE + 0) = num;
// Keep looping if not done
while ((*(MY_ACCEL_BASE + 2)) == 0);
// Read if prime
prime = 0;
if (*(MY_ACCEL_BASE + 1))
prime = 1;
}
Variables in C can either end up in registers or memory. Using
volatile
will ensure the variable is mapped to the register. Otherwise, it might be mapped to a register and thus no transaction on the bus.What will happen if the
volatile
on line 1 is removed?The
MY_ACCEL_BASE
might be put into a register, and thus the while loop might become infinite loop because it would just read the register over and over again.
We want: an accelerator that draws a box in the pixel buffer
The accelerator must be a slave because the processor can write and read values to the control registers
The accelerator must also be a master because the pixel buffer is stored in memory and the accelerator must initiate.
Memory map:
page 52
The slave interface would implement the registers and interface to the Avalon fabric.
page 60
One could create a custom instruction that is connected to some custom logic.
A question to ask is: “is it work accelerate?”. If, without loss of generality, 80% of the time the system is executing 10% of the code such as a loop, then consider implementing one.
What are the limitations?
Speedup is limited by the amount of code that we cannot speed up. Suppose \(P\) is the fraction of execution you can speed up, and \(S\) is the amount that we CAN speed up. Then the expression of for how much speedup there is is given as:
\[\frac{1}{(1-P)+\frac{P}{S}}\]For example, if 25% of the code executed can run twice as fast, then \(P=0.25\), \(S=2\) and the overall speed up is \(14\%\).