ELEC50010-IAC-CW/README.md

# AM04_CPU

ELEC50010 Instr. Arch + Comp. : CPU Coursework
==============================================

This is the coursework for the 2020-21 year of the IA+C coursework.

The submission timings are:

- Mon Nov 23rd : Coursework "officially" starts (it's a 1 month coursework).
- Mon Dec 7th 22:00 : Optional formative feedback point. If you submit your current work in progress,
    then it will get manually examined, and receive oral formative feedback.
- Wed 16th 22:00 : Optional sanity check point. Some simple scripts will be run on current submissions to
    check for things like file-names, whether scripts can be executed, and ability to test a
    CPU that is not your own.
- Mon 21st Dec 22:00 : Final deliverables due.

Revision log
============

- 2020/08/13 : v0 - Initial draft
- 2020/10/20 : v1.0 - Updated with harvard and bus to provide simpler learning curve.
- 2020/11/16 : v1.1 - Minor tweaks based on lab results.
- 2020/11/20 : v1.2 - Added missing environment/standards part.
- 2020/11/25 : v1.3 - Various tweaks and clarifications
    - Added the ability to include a provision script
    - Fixed the typo related to PC on reset.
    - Added gcc-mipsel-linux-gnu as explicitly available package.

Overall goals
=============

Your overall goals are to develop a working synthesisable MIPS-compatible CPU.
This CPU will interface with the world using a memory-mapped bus, which gives
it access to memory and other peripherals.

The goal of this coursework is not to get a single circuit working in a single
piece of hardware. Instead it is to develop a piece of IP which could be
sold and distributed to many clients, allowing them to integrate you CPU
into any number of products.  As a consequence the emphasis is on producing
a production quality CPU with a robust testing process - you should deliver
something that you expect to work on any FPGA or ASIC, rather than something
that just works on a single device.

The emphasis on creating a "real" CPU makes this a more complex task
than implementing a toy CPU with lots of extra debug hooks. In particular,
the emphasis on memory-based input/output is very realistic, but means
you need to be very methodical and analytical in the way you develop
both your CPU *and* your test-bench and test-cases.

Coursework deliverables
=======================

Your coursework deliverables consist of the following:

1.  `rtl/mips_cpu_bus.v` or `rtl/mips_cpu_harvard.v` : An implementation of a MIPS CPU which meets the pre-specified
    template for signal names and interface timings. You may also include other verilog
    modules in files of the form `rtl/mips_cpu/*.v` and/or `rtl/mips_cpu_*.v`.
    If you include both a `bus` and a `harvard` verilog file it will be assumed
    that you want the `bus` version to be assessed. Any files not matching
    these patterns will be ignored.

2.  `test/test_mips_cpu_bus.sh` or `rtl/test_mips_cpu_harvard.sh` : A test-bench for any CPU meeting the given interface.
    This will act as a test-bench for your own CPU, but should also aim to check
    whether any other CPU works as well. You can include both scripts, but only the
    one corresponding to your submitted CPU (bus or harvard) will be evaluated.

3.  `docs/mips_data_sheet.pdf` : A data-sheet for your CPU, consisting of at most 4 A4 pages. This
    data-sheet should cover:

    - The overall architecture of your CPU.
    - At least one diagram of your CPU's architecture.
    - Design decisions taken when implementing the CPU.
    - The approach taken to testing CPUs.
    - At least one diagram or flow-chart describing your testing flow or approach.
    - Area and timing summary for the "Cyclone IV E ‘Auto’" variant in Quartus (same as used in the EE1 "CPU" project).

4.  Peer feedback : individual submission by each group member to provide peer feedback
    on your team members, submitted via Microsoft Forms.

Assessment
==========

The coursework mark comes from the following components:

-   Functionality (40%) : does the CPU work?

    - This is assessed purely based on whether instructions are functionally correct.
    - The only method used to assess correctness is to look at the changes to RAM that the CPU performs,
        and/or the final value of register `v0`.
    - The same set of instructions are tested for both the bus and harvard interfaces, but if the harvard interface
        is used, then this component is scaled by `0.8`.

-   Testbench (30%) :   can the test-bench detect whether other CPUs work?

    - This is assessed by telling your test-bench to test other CPUs.
    - The variant of your test-bench (bus versus harvard) assessed will match your CPU.
    - You should expect it to be tested on a "perfect" CPU, as well as selectively broken CPU
    - Your test-bench should not say the perfect CPU fails (false-negative), nor should it say the broken CPU passes (false-positive).

-   Data-sheet (30%) : is the architectural and testing approach adequately described?

    - Have the required components been covered?
    - Is it a client-oriented document, rather than oriented at the people who developed the CPU?
    - Does it provide useful information specific to your solution?
    - Does it highlight any clever or important features/decisions?

-   Peer-feedback (+-5%) : allocated according to peer feed-back within the group. This
    will affect the individual mark by up to 5% compared to the group mark.

Submission
----------

Submission will be through a `.tar.gz` submitted via blackboard.

It is up to you
to choose/manage source code control through whatever tool or technology you want.
You can get access to github pro through the github education programme, but you
can use any other service your team prefers - if you want to work out of a shared
DropBox then that is up to you.

Note that any git repo should not be public while the assessment is ongoing, in
order to avoid any plagiarism concerns. Once the assessment is finished you can
make the code available publically.


CPU interface
=============

You have a choice of two different interface styles for your CPU to support:

- **Bus** : A true memory bus based interface, which is directly compatible with industrial
            IP blocks. This requires instructions and data to be fetched over the same
            interface, and also allows memory to have variable latency.

- **Harvard** : A simpler interface which provides seperate instruction and data
            memory interfaces. These interfaces also support combinatorial read paths,
            and single-cycle write paths.

You need to choose one of these methods for the final submission, but might find
it useful to start with harvard and then migrate to bus. Most of the internal
control and arithmetic logic can be directly shared between the two approaches,
as long as you are taking a disciplined approach to decomposing your design.
It is also (intentionally) possible to take a Harvard CPU and wrap it in a
module which will transparently adapt it to the Bus interface, which can
be another route to a working bus-based CPU.

If you include both a bus and a harvard variant in your submission, then it
is assumed that you intend the bus version to be the submitted version.

Because the harvard version simplifies away a number of more real-world
constraints, the functionality mark is scaled by 0.8 compared to the
same functionality in a bus CPU. The other components (testing and documentation)
are unaffected by whether harvard or bus is used.

Shared interface aspects
------------------------

Both interfaces share the following common signals:
```
module mips_cpu_...(
    input logic clk,
    input logic reset,
    output logic active,
    output logic[31:0] register_v0,
```

All signals are synchronous to `clk`, including `reset`.

The `reset` signal must be held high for at least 1 cycle to reset the CPU. This
is a level-sensitive reset, which is synchronous to the clock.

The `active` signal should be driven high when `reset` is asserted, and remain
high until the CPU halts. Once the CPU has halted (for any reason) the `active`
signal should be sent low.

If the CPU has completed execution (i.e. it has been reset and then `active` has been
sent low), then `register_v0` should contain the final value of register `$v0` (register index 2) from the
register file. This is purely to make your test-benches easier, and is not
something typically included in a CPU IP core.

The CPU does not have any support for interrupts or other input/output signals. The
only way of communicating is via memory bus transactions, the `active` signal, and
the `register_v0` signal.

Bus based interface
-------------------

The CPU uses a single [Avalon](https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl_avalon_spec.pdf)
compatible memory-mapped interface to interact with memory. Your
CPU acts as a bus controller, and issues read and write transactions in order to change
memory contents. However, it is important to remember that your CPU should be completely
independent of the memory itself. The memory may be a genuine hardware RAM implemented
using BRAM or DDR, or it could be a completely virtual memory provided by a test-bench.

The bus-based CPU interface has the following signals:
```
module mips_cpu_bus(
    /* Standard signals */
    input logic clk,
    input logic reset,
    output logic active,
    output logic[31:0] register_v0,

    /* Avalon memory mapped bus controller (master) */
    output logic[31:0] address,
    output logic write,
    output logic read,
    input logic waitrequest,
    output logic[31:0] writedata,
    output logic[3:0] byteenable,
    input logic[31:0] readdata
);
```

Avalon is a clock synchronous protocol, so `readdata` will not become
available until the cycle following the read request. The signal `waitrequest`
is used to indicate a stall
cycle, which means that the read or write request cannot complete in the
current cycle, and so must be continued in the next cycle.
See section 3.5.1 and Figure 7 of the
[Avalon spec](https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl_avalon_spec.pdf)
for more info.

Harvard interface
-----------------

Everything is easier if there are two seperate instruction and data memory
buses, _and_ the memory interfaces support combinatorial (zero-cycle) reads.
Taken together, these allow you to build the simple single-cycle data-path developed
during the first week of lectures. However, this is also very unrealistic, as
most CPUs (ignoring embedded micro-controllers) only have access to a single memory
bus, and have to deal with variable memory stall cycles. Unfortunately,
such a single memory bus design is complex, and represents a difficult starting
point, as there are two main ways of implementing it - either you need to
effectively implement an instruction and data cache plus appropriate
stall logic, or you need to implement a more complex multi-cycle finite-state
machine to execute the instructions.

The harvard interface here allows you to choose to use the simpler interface,
which removes a lot of that complexity. The interface is as follows:

```
module mips_cpu_harvard(
    /* Standard signals */
    input logic     clk,
    input logic     reset,
    output logic    active,
    output logic [31:0] register_v0,

    /* New clock enable. See below. */
    input logic     clk_enable,

    /* Combinatorial read access to instructions */
    output logic[31:0]  instr_address,
    input logic[31:0]   instr_readdata,

    /* Combinatorial read and single-cycle write access to instructions */
    output logic[31:0]  data_address,
    output logic        data_write,
    output logic        data_read,
    output logic[31:0]  data_writedata,
    input logic[31:0]  data_readdata
);
```

The signals prefixed `instr_` implement the instruction bus, while those
prefixed `data_` implement the data bus.

The new signal `clk_enable` supplies a clock enable, and should be
used to determine whether to update your flips-flops in a given cycle.
The general pattern for updating registers with a clock
enable is:
```
always_ff @(posedge clk) begin
    if (reset) then begin
        /* Do reset logic */
        my_ff <= ... ;
    end
    else if(clk_enable) then
        /* Perform clock update */
        m_ff <= ... ;
    end
end
```

The interface semantics guarantee that if `clk_enable` is high then the following conditions all hold:

1. `instr_readdata == MEMORY[instr_address]`
2. `data_read==1 -> data_readdata == MEMORY[data_readdata]`
3. `data_write==1 -> MEMORY[data_address] == instr_writedata`

Note that `A -> B` means logical implication, so "if A then B".

You should still combinatorially drive all other output signals (e.g. `data_read`, `data_write`, `instr_addr`)
during cycles where `clk_enable==0`, as the `clk_enable` signal is in part derived from
those signals.

The Harvard interface does not provide access to byte enables, which means
that partial store instructions (e.g. `sh`, `sb` and `swl`) are quite complicated.
If you are getting to that level it is probably better to switch to the
bus based interface.

Constraints on the interface are:

- `! (data_read & data_write)` : You cannot read and write in the same cycle.

- `data_write==1 -> instr_addr != data_addr` : You cannot modify the instruction currently
    begin read (note the comment later on self-modifying code).


Reset Behaviour
---------------

During reset (i.e. while the `reset` signal is high), the CPU should not initiate
any memory transactions, as the memory may also be resetting at the same time.

The `reset` signal may be held high for more than one cycle, as other IP
cores or devices could be driven by the same reset and need more than one
cycle to reset.

It is not specified what the CPU should do during reset, but the
_effect_ of reset should be that:

- All ISA-visible MIPS data registers are set to zero.
- The next instruction to be executed post-reset should be at address `0xBFC00000`.

The address `0xBFC00000` is the [reset vector](https://en.wikipedia.org/wiki/Reset_vector)
of the CPU, and is the conventional reset vector for a "real" MIPS CPUs. The slightly
odd address is to place it at the start of the 4MB region `[0xBFC00000,0xC0000000)`.

CPU Halt
--------

Often CPUs do not "finish" in a meaningful way, and the expectation is
that once a CPU powers on there will always be work for it to do. However,
here we want a definitive end point for CPU execution, in order to make
testing more tractable - we need to know when the CPU being tested has
finished, so that we can look at how it has modified memory. To make
things easier when learning, it is also very useful to have visibility
on some internal CPU state, as doing everything via memory assumes
you already have working memory instructions.

To make testing easier we include the `active` flag and the `register_v0`
flag. The dual purpose of these signals is:

1. To detect when the CPU has finished executing instructions.
2. To allow a single 32-bit value to be passed from inside the CPU to
   the top-level module, without requiring any memory transactions.

The CPU is considered to halt when it executes the instruction at
address 0. This behaviour is specific to this coursework specification, and
not a general property of the MIPS ISA, ABI, or commercial IP cores.

The reason for this choice is intimately related to the reset conditions
and [MIPS O32 ABI](https://en.wikipedia.org/wiki/MIPS_architecture#Calling_conventions);
in particular, this choice exploits the following existing requirements:

- For the reset, we require that all registers (excluding the PC) are set to 0.
- The MIPS ABI also specifies that integer return values from functions are placed in register $v0, which is defined to be register 2.
- The MIPS ABI also specifies that the return address for a function is stored in register $ra, which is defined to be register 31.

This means that the following function:
```
int f(){
    return 23;
}
```
can be assembled into the following assembly:
```
f:  li $2, 23   # Load 23 into register $2
    jr $31      # Jump to the address in $31 (which will be zero)
    nop
```

Note that a compiler is likely to exploit the delay slot, and so will
probably produce the following shorter code which exploits the delay slot:
```
f:  jr $31      # Jump to the address in $31 (which will be zero)
    li $2, 23   # Load 23 into register $2
```
If this rearranged code looks confusing, then look carefully at what
the ISA says about advancing the PC and branches.

CPU Performance
---------------

The goal of the exercise is to deliver a functionally correct CPU, so
performance is a secondary concern. However, your CPU should not exceed
a worst-case CPI of 36 (ignoring memory stall cycles).

Instruction Set
===============

The target instruction-set is 32-bit little-endian MIPS1, as defined by
the MIPS ISA Specification (Revision 3.2).

The instructions to be tested are:

Code    |   Meaning
--------|---------------------------------------------
ADDIU   |  Add immediate unsigned (no overflow)
ADDU    |  Add unsigned (no overflow)
AND     |  Bitwise and
ANDI    |  Bitwise and immediate
BEQ     |  Branch on equal
BGEZ    |  Branch on greater than or equal to zero
BGEZAL  |  Branch on non-negative (>=0) and link
BGTZ    |  Branch on greater than zero
BLEZ    |  Branch on less than or equal to zero
BLTZ    |  Branch on less than zero
BLTZAL  |  Branch on less than zero and link
BNE     |  Branch on not equal
DIV     |  Divide
DIVU    |  Divide unsigned
J       |  Jump
JALR    |  Jump and link register
JAL     |  Jump and link
JR      |  Jump register
LB      |  Load byte
LBU     |  Load byte unsigned
LH      |  Load half-word
LHU     |  Load half-word unsigned
LUI     |  Load upper immediate
LW      |  Load word
LWL     |  Load word left
LWR     |  Load word right
MTHI    |  Move to HI
MTLO    |  Move to LO
MULT    |  Multiply
MULTU   |  Multiply unsigned
OR      |  Bitwise or
ORI     |  Bitwise or immediate
SB      |  Store byte
SH      |  Store half-word
SLL     |  Shift left logical
SLLV    |  Shift left logical variable
SLT     |  Set on less than (signed)
SLTI    |  Set on less than immediate (signed)
SLTIU   |  Set on less than immediate unsigned
SLTU    |  Set on less than unsigned
SRA     |  Shift right arithmetic
SRAV    |  Shift right arithmetic
SRL     |  Shift right logical
SRLV    |  Shift right logical variable
SUBU    |  Subtract unsigned
SW      |  Store word
XOR     |  Bitwise exclusive or
XORI    |  Bitwise exclusive or immediate

It is strongly suggested that you implement the following
instructions first: `JR, ADDIU, LW, SW`. This will match
the instructions considered in the formative assessment.

Memory Map
==========

Your CPU should not make any explicit assumptions about the location
of instructions, data, or peripherals within the address space. It should
simply execute the instructions it is given, and perform reads and writes
at the addresses implied by the instructions.

There are only two special memory locations:

- `0x00000000` : Attempting to execute address 0 causes the CPU to halt.
- `0xBFC00000` : This is the location at which execution should start after reset.

Whether a particular address maps to RAM, ROM, or something else is entirely
down to the top-level circuit outside your CPU. It may be that the top-level
is a test-bench which contains small simulated memories, and simply maps
transactions to reads and writes of a verilog array. Or the test-bench
could emulate only the specific addresses that it expects to be read or written,
without tracking the actual memory contents. Alternatively your CPU may have
been synthesised into an FPGA, in which case the memories may correspond
to a large set of block RAMs, DDR, network adaptors, and anything else
your customer decided to attach the CPU to.

Exceptions
==========

Our memory bus has no mechanism for indicating that a particular
read or write access failed, in order to keep the interface simple.
This means that there is no portable way for you to test how
a given processor responds to invalid addresses. The only thing
you can do is give it test-cases which will result in it accessing
a known sequence or range of addresses, and then check that it does indeed
access those addresses. If a CPU-under-test ever accesses an
address which is outside that set of known addresses, then
you can legitimately claim that it failed the test-case, and
halt the test-bench immediately (if you wish). Similarly,
if the CPU-under-test does not access an address which you know
must be accessed, then it must also have failed.
_You are not required to validate the exact sequence of addresses,_
_this is simply talking about what is valid or not to test._

There is also no defined mechanism to allow CPUs to indicate
that an arithmetic exception has occurred (e.g. overflow). As
a consequence, the various overflow-checking instructions (`add`, `sub`)
etc. are not included in the testable set of instructions. So while
you can implement them in your CPU, you should not attempt to
execute them in your general test-bench. Note that `gcc` will
not generate such instructions by default, so you will not see
them if compiling C code to MIPS.
_This restriction is quite artificial and only for coursework purposes._
_There is a well-defined mechanism based on exception handlers_
_that could have been used, and would require no changes to the_
_Verilog interface._

A CPU is not required to have any specific handling for undefined
or out-of-spec instructions. So a correct CPU can take any
reasonable default behaviour if it is asked to execute an instruction which
is outside the defined set of testable instructions. Note that
"reasonable" does not mean "any" - you shouldn't deliberately
take destructive actions if an invalid instruction is encountered.

Test-bench
==========

Your test-bench is a bash script called `test/test_mips_cpu_bus.sh` or `test/test_mips_cpu_harvard.sh`
that takes a required argument specifying a directory containing an RTL CPU implementation, and
an optional argument specifying which instruction to test:
```
test/test_mips_cpu_(bus|harvard).sh [source_directory] [instruction]?
```
Here `source_directory` is the relative or absolute path of a directory
containing a verilog CPU, and `instruction` is the lower-case name of
a MIPS instruction to test. If no instruction is specified, then all
test-cases should be run. Your test-bench may choose to ignore the
instruction filter, and just produce all outputs.

The test-bench should print one-line per test-case to stdout, with the
each line containing the following components separated by whitespace:

1.  Testcase-id : A unique name for the test-case, which can contain any of the characters `a-z`, `A-Z`, `0-9`, `_`, or `-`.
2.  Instruction : the instruction being tested, given as the lower-case MIPS instruction name.
3.  Status : Either the string "Pass" or "Fail".
4.  Comments : The remainder of the line is available for free-from comments or descriptions.

If there are no comments then a trailing comma is not needed. Examples of
possible output are:
```
addu_1 addu Pass
addu-2 addu Fail   Test return wrong value
MULTZ    mult    Pass    # Multiply by zero
```

Assuming you are in the root directory of your submission, you could test your
CPU `rtl/mips_cpu_bus.v` as follows:
```
$ test/test_mips_cpu_bus.sh rtl
addu_1 addu Pass
addu_2 addu Pass
subu_1 subu Pass
subu_2 subu Pass
```
Restricting it to use the addu instruction:
```
$ test/test_mips_cpu_bus.sh rtl addu
addu_1 addu Pass
addu_2 addu Pass
```

If you were to replace `bus` with `harvard` then it should would
instead test the `harvard` implementation.

Your test-bench does not need to implement the instruction filter argument,
and can choose to just run all test-cases every time it is run. However, you
should be aware that if your test-bench locks up or otherwise aborts on
one instruction, then it will appear as if all following instructions were
never tested.

The total simulation time for your entire test-bench should not exceed
10 minutes on a typical lap-top.

Your test-bench should never modify anything located in the mips source directory.
So it should not create any files in the source directory (e.g. `rtl`), and it
definitely should not modify any of the files.

Working and input directory
-----------------

To keep things simple, you can assume that your test-script will always be
called from the base directory of your submission. This just means that
your script is always invoked as `test/test_mips_cpu_bus.sh`.

However, you should not assume anything about the directory containing the
source MIPS. This could be a sub-directory of your project, or could be
at some other relative or absolute path. For example, it might be invoked
as:
```
test/test_mips_cpu_bus.sh ../../reference_mips_cpu
```
to get your testbench to execute against a reference CPU. Or it could
be invoked as:
```
test/test_mips_cpu_bus.sh /home/dt10/elec50010/cw/marking/team-23/rtl
```
Either way, your test-bench just needs to compile the verilog files
included in that

Auxiliary files
---------------

Your test-bench can make use of any number of auxiliary files and directories,
for example things like testcase inputs, pre-compiled object files, or whatever
you like. You should aim to keep the submission as small as possible (e.g.
using `.gitignore` files), but there is no penalty for including more than is
needed.

Environment and Standards
=========================

The verilog should be written to adhere to the sub-set of SystemVerilog 2012
supported by Icarus verilog 11.0. CPUs should be written to assume that
verilog files are compiled with `-g 2012`, and test-benches should also
provide that flag when compiling.

The test environment should be assumed to be Ubuntu 18.04. Version 11.0
of Icarus verilog is already compiled and installed. Standard base Ubuntu
packages will be installed, along with the following packages:

- `build-essential` (g++, make)
- `git`
- `gcc-mipsel-linux-gnu` and `gcc-mips-linux-gnu`
- `qemu-system-mips`
- `python3`
- `cmake`
- `verilator`
- `libboost-dev`
- `parallel`

Provisioning
------------

If there is a particular package that you want to use, such as a python
library or standard Ubuntu package, then you can include a script called `provision.sh`
which can install such packages. You can assume that this package will be
run once as root before your test-bench is installed.

Note that this script is completely optional. Most teams probably won't need one.

Exactly two types of package are allowed:

- Ubuntu package installation via `apt install`. This must be a standard Ubuntu package,
    with no use of PPAs or other package sources.
- Python package installation via `pip install` or `pip3 install`. This must be a package
    coming from the standard pip set of packages.


Clarifying notes
================

Self-modifying code
-------------------

No distinction should be made between instruction and data addresses - it is legal
to both read a memory address as data and to execute it. For almost all implementations
this should happen naturally, and is a corner case that only comes into effect
with seperate instruction and data caches.

However, we will require that no address that is executed as an instruction
is every modified. This is because we lack any method to tell CPUs that their
instruction caches (if they exist) may have been invalidated by data accesses.

How to choose between bus and harvard?
---------------------------------------

If you think about it, a large amount can be shared between
the two as long as you create split things up logically. In
terms of test-cases for MIPS instructions, they are going to
be the same between the two approaches. It is only the test-bench
which is going to have to implement a different interface for
the CPU, but the instructions it loads can be the same.

Similarly, in the CPU you should find that all the instruction
decode and execute logic is mostly the same. It is only the
parts that deal with instruction timing and memory that are
different. So you can have a single shared execution core
that is used by two variants.