**Imperial College**<br>London

# DSD **COURSEWORK**

# TABLE OF CONTENTS



Requirements to complete this coursework:

Quartus Prime Lite (v.20); Nios II Embedded Design Suite (v.20); Modelsim (Modelsim-Altera or ActiveHDL); Matlab (or Python); Terasic DE1-SoC board; Coursework source files including DE1-SoC documentation; On-line references from Intel.

# Legend:



Important information.



Reference material that you should read before proceeding.



 $\frac{1}{100}$ Debug hints

 $\odot$ Helpful information **Please read this document carefully before starting the coursework as it will guide you to complete the coursework efficiently. There are links to references that explain, and exemplify, procedures you will need to know how to do. You're advised to read and understand them.**

#### <span id="page-3-0"></span>**INTRODUCTION**

In this coursework you will acquire experience in the design of digital systems. As a case study, we will focus on the acceleration of computationally intensive applications and we will see how having customising hardware blocks tuned for a certain application can lead to a better performing system compared to one that does not have these computational units.

In this coursework we will start by running applications on a generic central processing unit (CPU), assess the system's performance and then focus on the design of dedicated hardware processing elements for the acceleration of the same applications.

The system design will target a Field-Programmable Gate Array (FPGA) from Altera. They are generic digital devices that support the implementation of any digital hardware design after their fabrication, including the implementation of soft-processors. The FPGA you're going to use is on a development kit (DE1-SoC) from Terasic, which is available in the lab. The target CPU is the soft-processor NIOS II, also provided by Intel. More information about NIOS II can be found here:

<https://www.intel.co.uk/content/www/uk/en/products/programmable/processor/nios-ii.html>

The main application for configuration of the FPGA is Quartus from Intel. It allows to design and program digital systems based on FPGAs. The Platform Designer tool allows the implementation of a complex digital system including a CPU and all the required components to make it work. You can find more about it here:

[https://www.intel.com/content/www/us/en/programmable/products/design-software/fpga](https://www.intel.com/content/www/us/en/programmable/products/design-software/fpga-design/quartus-prime/features/qts-platform-designer.html)[design/quartus-prime/features/qts-platform-designer.html](https://www.intel.com/content/www/us/en/programmable/products/design-software/fpga-design/quartus-prime/features/qts-platform-designer.html)

You will program the NIOS II processor by writing C code using EDS. How to do this is described in the tutorial pointed out in the hand-out.

In the coursework you are about to do, you'll start by describing your system in Platform Designer (which you can access from Quartus), synthesize in Quartus, and then configure the FPGA. At this point you don't need to write any Verilog/VHDL. Verilog is a hardware description language. Later you can use verilog /block diagrams/VHDL to describe your customized hardware blocks.

#### <span id="page-3-1"></span>**Objective: Acceleration of Arithmetic Functions**

In this coursework you are going to see how dedicated hardware can be applied to a general-purpose digital system in order to accelerate certain parts of the calculations. The more specialised the hardware blocks are, the faster the computation becomes, but that's reflected in FPGA resources or circuit area.

The main purpose of this coursework is to create a digital system that evaluates the following function:

#### $f(x) = \sum_{i=1}^{N} 0.5 * x_i + x_i^2 \cos((x_i - 128)/128)$  $\sum_{i=1}^{N} 0.5 * x_i + x_i^2 \cos((x_i - 128)/128)$  (1)

under **single-precision number representation**, for **any** given input vector **x**, where each element of the **vector is a single-precision floating-point number taking values between 0 and 255**.

In order to test and assess the performance of your system, you should use the following test cases (Matlab notation):

- 1. X=0:5:255
- 2. X =0:1/8:255

#### 3.  $X = 0:1/1024:255$

where x is a single-precision floating-point number. For example, the first case describes an input vector X with the following elements: [0, 5, 10, 15, ..., 255].<sup>1</sup> Please note that these are specific test examples. Your **system should not be optimised for these instances.**

Your design should be assessed under the following metrics (when they are applicable).

**FPGA resources used by your system (Logic Elements, Embedded Multipliers, Memory Bits). As the** target device is fixed, the proposed metric for FPGA resources is the weighted sum of the utilised resources. For example, if your design uses *LE* Logic Elements, *EM* Embedded Multipliers, and *MB* Memory Bits, then the FPGA resources are:

FPGA resources =  $\frac{1}{2}$  $\frac{1}{3} \Big( \frac{LE}{Total \, LE \, in \, the \, device} + \frac{EM}{Total \, EM \, in \, the \, device} + \frac{MB}{Total \, MB \, in \, the \, device} \Big)$ 

- $\blacksquare$ Latency or Execution time – This is the time that is needed for your system to compute (1) for an input vector X. This includes only the time needed for the evaluation of (1) for the whole vector, without taking into account any initialisation time and so on.
- $\blacksquare$ Application size (program).

Thus, you should be able to plot the performance of your design in a performance vs FPGA resources plot and compare different solutions.

# <span id="page-4-0"></span>Task 1: NIOS II Setup

In this task you will get familiar with the design of a NIOS II processor and the design choices that impact the system's performance.

Altera, the manufacturer of your target FPGA, offers an embedded processor (also known as soft-processor), named NIOS II, which can be instantiated in any design, and even more than once. The software programming for NIOS II is done in EDS (Eclipse) using C code.

Learn how to setup a system with a NIOS II processor, and start building your project. Download the project template (system\_template\_de1\_soc) available on the DSD coursework page. Place the files to a directory of your choice.

Do not use spaces in the directory path name.

Do not download and install the design files from Altera.

 $1$  Hint: When you generate the input vector in Matlab for comparing your system's result, you need to generate the elements in the same way otherwise the Matlab input and your C code input will differ (think about it). You need to make Matlab to match the generated vectors from your C code.

Your main file is the hello word.qpf. Open the file to see what has been given to you. You will see just some input/output pins only. These pins are connected to specific pins on the FPGA of your board in order to allow your design to communicate with the outside world. In the Project Navigator you will see that the Cyclone V device has been selected as the target device. This is the device that comes with your DE1-SoC board.

The rest of the section is a summary extracted by the NIOS II tutorial provided by Intel ( http://www.altera.com/literature/tt/tt\_nios2\_hardware\_tutorial.pdf https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an717.pdf)

#### **Create a New Platform Designer System**

The design of the processor is a complex process, but Altera provides a tool (Platform Designer) to help you with this task. Through Platform Designer, you can generate a NIOS II processor system, add the desired components, and configure how they connect together.

Click Platform Designer on the Tools menu in the Quartus II software. A new window will open and your system will have on the clock signal.

#### **Define the System in Platform Designer**

You use Platform Designer to define the hardware characteristics of the Nios II system, such as which Nios II core to use, and what components to include in the system. Platform Designer does not define software behaviour, such as where in memory to store instructions or where to send the stderr character stream.

In this section, you perform the following steps:

- 1. Specify target FPGA and clock settings.
- 2. Add the Nios II core, on-chip memory, and other components.
- 3. Specify base addresses and interrupt request (IRQ) priorities.
- 4. Generate the system.

The Platform Designer design process does not need to be linear. The design steps in this tutorial are presented in the most straightforward order for a new user to understand. However, you can perform Platform Designer design steps in a different order.

#### **Specify Target FPGA and Clock Settings**

The **Clock Settings** and the **Project Settings** tabs specify the system's relationship to other devices in the system.

Perform the following steps:

1. On the Platform designer view, select the **Device Family** tab in the bottom left (or go to View -> Device Family) and ensure the device matches the Altera FPGA you are targeting. If a warning appears stating the selected device family does not match the Quartus project settings, ignore the warning. You specify the device in the Quartus project settings later in this tutorial. Since you have run Platform Designer from your project, where the device has been already specified, these fields should be already populated.

2. In the documentation for your board look up the clock frequency of the oscillator that drives the FPGA.

3. In the Platform designer view there should already be a Clock Source (clk\_0) instantiated. Double click the name to edit the clock frequency. The frequency you specify for clk 0 must match the oscillator that drives the FPGA. Your board has a 50MHz oscillator. Set this value, if it has not been already populated.

Next, you begin to add hardware components to the Platform Designer system. As you add each component, you configure it appropriately to match the design specifications.

#### **Add the On-Chip Memory**

Processor systems require at least one memory for data and instructions. As a starting point, let's use one 20 KB on-chip memory for both data and instructions. To add the memory, perform the following steps:

- 1. On the **IP Catalog** tab (to the left of the **System Contents** tab), expand **Basic Functions**, expand **On Chip Memory**, and then click **On-Chip Memory (RAM or ROM).**
- 2. Click **Add**. The On-Chip Memory (RAM or ROM) parameter editor appears.
- 3. In the **Type** list, select **RAM (Writeable)**.
- 4. In the **Block type** list, select **Auto**.
- 5. In the **Total memory size** box, type 20480 to specify a memory size of 20 KB.

Do not change any of the other default settings

- 6. Click **Finish**. You return to Platform Designer.
- 7. Click the **System Contents** tab. An instance of the on-chip memory appears in the system contents table.
- 8. In the **Name** column of the system contents table, right-click the on-chip memory and click **Rename**.
- 9. Type onchip mem and press Enter.

In the System Contents you can see that the memory has been added and possible connections to existing signals have been highlighted. We will come back to these connections later.

#### **Add the Nios II Processor Core**

In this section you add the Nios II/f core and configure it to use 2 KB of on-chip instruction cache memory, and without any hardware support for multiplication or division operations.

Perform the following steps to add a Nios II/f core to the system:

- 1. On the **IP Catalog** tab, expand **Processors and Peripherals -> Embedded Processors**, and then click **Nios II Processor**.
- 2. Click **Add**. The Nios II Processor parameter editor appears, displaying the **Main** tab. A new GUI pops-up.
- 3. Under **Select a Nios II core**, select **Nios II/f**.
- 4. Switch to the **Arithmetic Instructions** tab for **Multiply/Shift/Rotate Hardware** select **Manual Selection**.
- 5. Set **Multiply Implementation** and **Multiply extended Implementation** to **None.**
- 6. Set **Divide Hardware** to **None**.
- 7. Click **Finish**. You return to the **Platform Designer System Contents** tab, and an instance of the Nios II core appears in the system contents table. Ignore the exception and reset vector error messages. You resolve these errors in steps 13 and 14.
- 8. In the **Name** column, right-click the Nios II processor and click **Rename**.
- 9. Type cpu and press Enter.
- 10. In the **Connections** column, connect the clk port of the **clk\_0** clock source to both the clk1 port of the onchip memory and the clk port of the Nios II processor by clicking the hollow dots on the connection line. The dots become solid indicating the ports are connected.
- 11. Connect the clk reset port of the **clk\_0** clock source to both the reset1 port of the on-chip memory and the reset port of the Nios II processor.
- 12. Connect the s1 port of the on-chip memory to both the data\_master port and instruction\_master port of the Nios II processor.

This last step connects the NIOS processor with the on-chip memory that you have instantiated through the Avalon bus. The NIOS processor serves as the master in this connection (as it initiates transactions) where the memory has a slave interface. The jtag signals are used for debugging purposes.

- 13. Double-click the Nios II processor row of the system contents table to reopen the Nios II Processor parameter editor.
- 14. Switch to the **Vectors** tab. Under **Reset Vector**, select **onchip\_mem.s1** in the **Reset vector memory** list and type 0x0 in the **Reset vector offset** box. This tells to the processor where to set its Program Counter (PC) in case of a reset.
- 15. Under **Exception Vector**, select **onchip\_mem.s1** in the **Exception vector memory** list and type 0x20 in the **Exception vector offset** box.
- 16. Click the **Caches and Memory Interfaces** tab.
- 17. In the **Instruction cache** list, select **2 Kbytes**.
- 18. In the **Add burstcount signal to instruction\_master** list, select **Disable**.
- 19. In the **Number of tightly coupled instruction master port(s)** list, select **None**.

Do not change any settings on the **Advanced Features, MMU and MPU Settings, JTAG Debug Module,** or **Custom Instruction** tabs.

#### **Add the JTAG UART**

The JTAG UART provides a convenient way to communicate character data with the Nios II processor through the USB-Blaster download cable. Perform the following steps to add the JTAG UART:

- 1. On the **IP Catalog** tab, expand **Interface Protocols,** expand **Serial**, and then click **JTAG UART Intel FPGA IP**.
- 2. Click Add. The JTAG UART parameter editor appears. Do not make any changes.
- 3. Click **Finish**. You return to the Platform Designer **System Contents** tab, and an instance of the JTAG UART appears in the system contents table.
- 4. In the **Name** column, right-click the JTAG UART and click **Rename**.
- 5. Type jtag uart and press Enter.
- 6. Connect the clk port of the **clk\_0** clock source to the clk port of the JTAG UART.
- 7. Connect the clk\_reset port of the clk\_0 clock source to the reset port of the JTAG UART.
- 8. Connect the data master port of the Nios II processor to the avalan jtag slave port of the JTAG UART.

Please note that the instruction\_master port of the Nios II processor does not connect to the JTAG UART because the JTAG UART is not a memory device and cannot send instructions to the Nios II processor.

#### **Add the Interval Timer**

Most control systems use a timer component to enable precise calculation of time. To provide a periodic system clock tick, the Nios II HAL requires a timer. Perform the following steps to add the timer:

- 1. On the **IP Catalog** tab, expand **Processors and Peripherals -> Peripherals** and then click **Interval Timer Intel FPGA IP**.
- 2. Click **Add**. The Interval Timer parameter editor appears.
- 3. In the **Presets** tab (on the right), select **Full-featured** and press **Apply**. Please make a note of the Timeout period setting.
- 4. Click **Finish**. You return to the Platform Designer **System Contents** tab, and an instance of the interval timer appears in the system contents table.
- 5. In the **Name** column, right-click the interval timer and click **Rename**.
- 6. Type sys clk timer and press Enter.
- 7. Connect the clk port of the **clk\_0** clock source to the clk port of the interval timer.
- 8. Connect the clk\_reset port of the **clk\_0** clock source to the reset port of the interval timer.
- 9. Connect the data master port of the Nios II processor to the s1 port of the interval timer.

#### **Add the System ID Peripheral**

The system ID peripheral safeguards against accidentally downloading software compiled for a different Nios II system. If the system includes the system ID peripheral, the Nios II SBT for Eclipse can prevent you from downloading programs compiled for a different system.

Perform the following steps to add the system ID peripheral:

- 1. On the **IP Catalog** tab, expand **Basic Functions -> Simulation; Debug and Verification -> Debug and Perfomance**, and then click **System ID Peripheral Intel FPGA IP**.
- 2. Click **Add**. The System ID Peripheral parameter editor appears.
- 3. Click **Finish**. You return to the Platform Designer **System Contents** tab, and an instance of the system ID peripheral appears in the system contents table.
- 4. In the **Name** column, right-click the system ID peripheral and click **Rename**.
- 5. Type sysid and press Enter.
- 6. Connect the clk port of the **clk\_0** clock source to the clk port of the system ID peripheral.
- 7. Connect the clk\_reset port of the **clk\_0** clock source to the reset port of the system ID peripheral.
- 8. Connect the data master port of the Nios II processor to the control slave port of the system ID peripheral.

#### **Add the PIO**

PIO signals provide an easy method for Nios II processor systems to receive input stimuli and drive output signals. Complex control applications might use hundreds of PIO signals which the Nios II processor can monitor. This design example uses eight PIO signals to drive LEDs on the board.

Perform the following steps to add the PIO.

- 1. On the **IP Catalog** tab, expand **Processors and Peripherals -> Peripherals**, and then click **PIO (Parallel I/O) Intel FPGA IP**.
- 2. Click **Add**. The PIO (Parallel I/O) parameter editor appears.
- 3. Do not change the default settings. Click Finish and return to the Platform Designer **System Contents** tab. In the name column right-click the PIO and click rename. Give the name **led\_pio** and press enter.
- 4. Connect the **clk** port to **clk\_0** clock source, connect the **reset** port to **clk\_reset** of clk\_0, and connect the **data\_master** port of the NIOS II processor to the **s1** port of PIO
- 5. In the external connection row, click **Double-click to export** (grayed out) in the **Export** column to export the PIO ports.

#### **Specify Base Addresses and Interrupt Request Priorities**

At this point, you have added all the necessary hardware components to the system. Now you must specify how the components interact to form a system. In this section, you assign base addresses for each slave component, and assign interrupt request (IRQ) priorities for the JTAG UART and the interval timer. Platform Designer provides the **Assign Base Addresses** command which makes assigning component base addresses easy. For many systems, including this design example, **Assign Base Addresses** is adequate. However, you can adjust the base addresses to suit your needs. Below are some guidelines for assigning base addresses:

■ Nios II processor cores can address a 31-bit address span. You must assign base address between 0x00000000 and 0x7FFFFFFF.

■ Nios II programs use symbolic constants to refer to addresses. Do not worry about choosing address values that are easy to remember.

■ Address values that differentiate components with only a one-bit address difference produce more efficient hardware. Do not worry about compacting all base addresses into the smallest possible address range, because this can create less efficient hardware.

■ Platform Designer does not attempt to align separate memory components in a contiguous memory range. For example, if you want an on-chip RAM and an off-chip RAM to be addressable as one contiguous memory range, you must explicitly assign base addresses.

Platform Designer also provides an **Assign Interrupt Numbers** command which connects IRQ signals to produce valid hardware results. However, assigning IRQs effectively requires an understanding of how software responds to them. Because Platform Designer does not know the software behavior, Platform Designer cannot make educated guesses about the best IRQ assignment.

The Nios II HAL interprets low IRQ values as higher priority. The timer component must have the highest IRQ priority to maintain the accuracy of the system clock tick.

To assign appropriate base addresses and IRQs, perform the following steps:

- 1. On the System tab on the toolbar, click **Assign Base Addresses** to make Platform Designer assign functional base addresses to each component in the system. Values in the **Base** and **End** columns might change, reflecting the addresses that Platform Designer reassigned.
- 2. In the **IRQ** column, connect the Nios II processor to the JTAG UART and interval timer.
- 3. Click the IRQ value for the **jtag\_uart** component to select it.
- 4. Type 16 and press Enter to assign a new IRQ value.
- 5. Click the IRQ value for the **sys\_clk\_timer** component to select it.
- 6. Type 1 and press Enter to assign a new IRQ value.

#### **Generate the System**

You are now ready to generate the system. Perform the following steps:

- 1. Click the **Generate HDL** in the bottom right corner.
- 2. Select **None** in the **Create simulation model** list. Because this tutorial does not cover the hardware simulation flow, you can select these settings to shorten generation time.
- 3. Under **Output Directory** for **Path** select an appropriate path**. Click Generate**. The **Save changes?** dialog box appears, prompting you to save your design.
- 4. Type first\_nios2\_system in the **File name** box and click **Save**. The **Generate** dialog box appears and system generation process begins. The generation process can take several minutes. Output messages appear as generation progresses. When generation completes, the final "Info: Finished: Create HDL design files for synthesis" message appears.
- 5. Click **Close** to close the dialog box.
- 6. On the File menu, click **Exit** to close Platform Designer and return to the Quartus Prime Lite software. Press **OK** on the pop up box.
- 7. In Quartus, select **Add/Remove Files in Project** from the **Project** tab in the toolbar. Browse for the **first\_nios2\_system.qsys** file and once it is added, click Apply and OK. This will add the system to the Quartus project.
- 8. In the **Add/Remove Files in Project** window, select the libraries tab on the left. Under **Project Libraries**, select the top-level folder for the project (system\_template\_de1\_soc) which contains the first\_nios2\_system.qys file. Add this folder and click Apply and OK.

Congratulations! You have finished creating the Nios II processor system. You are ready to integrate the system into the Quartus II hardware project and use the Nios II SBT for Eclipse to develop software.

#### **Integrate the Platform Designer System into the Quartus II Project**

To instantiate the system module in the .bdf (i.e. your schematic capture file), perform the following steps (by closing Platform Designer, you should have gone back to Quartus on your hello world system):

- 1. Double-click in the empty space to the right of the input and output wires. The Symbol dialog box appears.
- 2. Under Libraries, expand Project.
- 3. Click first\_nios2\_system. The Symbol dialog box displays the first\_nios2\_system symbol.
- 4. Click OK. You return to the .bdf schematic. The first\_nios2\_system symbol tracks with your mouse pointer.
- 5. Position the symbol so the pins on the symbol align with the wires on the schematic.
- 6. Click to anchor the symbol in place.
	- a) Make the connections:
	- b) connect clk\_clk to iCLK\_50
	- c) connect reset\_reset\_n to Vcc (Symbol dialog box -> /home/e2/intelFPGA\_lite/20.1/quartus/libraries -> primitives -> other)
	- d) connect the leds.
	- e) leave the DRAM pins unconnected.
- 7. To save the completed .bdf, click Save on the File menu.

#### **Compile the Quartus II Project and Verify Timing**

At this point you are ready to compile the Quartus II project and verify that the resulting design meets timing requirements. You must compile the hardware design to create a .sof that you can download to the board. After the compilation completes, you must analyze the timing performance of the FPGA design to verify that the design will work in hardware. In this case, the sdc file (i.e. the Synopsys Design Constraints File) has already been included to your project (by us), so you do not have to worry about this. You just need to know that the target clock frequency for the design is 50MHz.

To compile the Quartus II project, perform the following steps:

- 1. On the Processing menu, click **Start Compilation**. The Tasks window and percentage and time counters in the lower-right corner display progress. The compilation process can take several minutes. When compilation completes, a dialog box displays the message "Full Compilation was successful."
- 2. Click **OK**.
- 3. Expand the **Timing Analyzer** category in the compilation report.
- 4. Click **Multicorner Timing Analysis Summary**.
- 5. Verify that the **Worst-case Slack** values are positive numbers for **Setup**, **Hold**, **Recovery**, and **Removal**. If any of these values are negative, the design might not operate properly in hardware. To meet timing, adjust Quartus assignments to optimize fitting, or reduce the oscillator frequency driving the FPGA. (You should not have to worry for any of the above). You can ignore errors from the unconstrained paths report.

Congratulations! You have finished integrating the Nios II system into the Quartus II project. You are ready to download the **.sof** to the target board.

#### **Download the Hardware Design to the Target FPGA**

In this section you download the **.sof** to the target board. Perform the following steps:

- 1. Connect the board to the host computer with the download cable, and apply power to the board.
- 2. In the VirtualBox toolbar go to Devices->USB and ensure Altera DE\_SoC [0139] is checked. If only Altera [0001] is displayed, check this for now, and once you are looking for hardware, check Devices->USB again and Altera DE SoC [0139] should now be available.
- 3. On the Tools menu in the Quartus Prime Lite software, click **Programmer**. The Quartus Prime Lite Programmer appears and automatically displays the appropriate configuration file (**hello\_world.sof)**
- 4. Click **Hardware Setup** in the upper left corner of the Quartus Programmer to verify your download cable settings. The **Hardware Setup** dialog box appears.
- 5. Select the appropriate download cable in the **Currently selected hardware** list. If the appropriate download cable does not appear in the list, you must first install drivers for the cable.
- 6. Click **Close**.
- 7. Click **Auto Detect**. A window will pop up and you should select the relevant serial code for the device.
- 8. Remove the <none> file for the 5CSEMA device and leave both the **hello world time limited.sof** file and the **SOCVHPS** device. If this removed the **hellow\_world\_time\_limited.sof** file, just add it back in using the **Add File** button.
- 9. In the **hello\_world\_time\_limited.sof** row, turn on **Program/Configure**.

10. Click **Start**. The **Progress** meter sweeps to 100% as the Quartus II software configures the FPGA.

At this point, the Nios II system is configured and running in the FPGA, but it does not yet have a program in memory to execute.

#### **Develop Software Using the Nios II SBT for Eclipse**

In this section, you start the Nios II SBT for Eclipse and compile a simple C language program. This section presents only the most basic software development steps to demonstrate software running on the hardware system you created in previous sections.

#### **Create a New Nios II Application and BSP from Template**

In this section you create new Nios II C/C++ application and BSP projects. Perform the following steps:

- 1. Start the Nios II SBT for Eclipse. If you are running on the provided VM, open a Terminal window and type eclipse. On Windows computers, click **Start**, point to **Programs, Altera, Nios II EDS** *<version>*, and then click **Nios II** *<version>* **Software Build Tools for Eclipse**. On Linux computers (and not running the provided VM), run the executable file *<Nios II EDS install path>***/bin/eclipse-nios2**.
- 2. If the **Workspace Launcher** dialog box appears, click **OK** to accept the default workspace location or choose one if you prefer a non-default location.
- 3. On the Window menu, point to **Perspective->Open Perspective**, and then either click **Nios II**, or click **Other** and then click **Nios II** to ensure you are using the Nios II perspective.
- 4. On the File menu, point to **New**, and then click **Nios II Application and BSP from Template**. The Nios II Application and BSP from Template wizard appears.
- 5. Under **Target hardware information**, next to **SOPC Information File name**, browse to locate the *<design files directory>*.
- 6. Select **first\_nios2\_system.sopcinfo** and click **Open**. You return to the Nios II Application and BSP from Template wizard showing current information for the **SOPC Information File name** and **CPU name** fields.
- 7. In the **Project name** box, type hello\_world.
- 8. In the **Templates** list, select **Hello World**.
- 9. Click **Finish**.

The Nios II SBT for Eclipse creates and displays the following new projects in the Project Explorer view, typically on the left side of the window:

- **hello\_world**—Your C/C++ application project
- **hello\_world\_bsp**—A board support package that encapsulates the details of the Nios II system hardware

#### **Compile the Project**

In this section you compile the project to produce an executable software image. For the tutorial design example, you must first adjust the project settings to minimize the memory footprint of the software, because your Nios II hardware system contains only 20 KB of memory.

Perform the following steps:

1. In the Project Explorer view, right-click **hello\_world\_bsp** and click **Properties**. The **Properties for hello\_world\_bsp** dialog box appears.

- 2. Click the **Nios II BSP Properties** page. The **Nios II BSP Properties** page contains basic software build settings (it may take a while).
- 3. Adjust the following settings to reduce the size of the compiled executable:
	- a. Turn on **Reduced device drivers**.
	- b. Turn off **Support C++**.
	- c. Turn off **GPROF support**.
	- d. Turn on **Small C library**.
	- e. Turn off **ModelSim only, no hardware support**.
- 4. Click **OK** (Do not click **Apply**). The BSP regenerates, the **Properties** dialog box closes, and you return to the Nios II SBT for Eclipse.
- 5. Right click on **hello\_world\_bsp** then go to **NIOS II->Generate BSP**
- 6. In the Project Explorer view, right-click the **hello\_world** project and click **Build Project**.

The **Build Project** dialog box appears, and the Nios II SBT for Eclipse begins compiling the project. When compilation completes, a "hello\_world build complete" message appears in the Console view.

**The console provides info on the size of your program (hello\_world.elf), which is the total size of the code and the initialised data, plus info on how much memory has been left free for stack and heap.**

#### **Run the Program on Target Hardware**

In this section you download the program to target hardware and run it. If you are using a windows machine, you should be able to right click on your project and select **Run as -> Run Nios II Hardware**

If you are using Linux or the provided VM then to download the software executable to the target board, perform the following steps:

- 1. If this is the first time running a program on the NIOSII, open a new terminal window and copy the provided bashrc file and save it at **~/.bashrc.**
- 2. Open a new terminal and run the command **nios2\_shell** followed by **nios2-terminal**. This opens the nios2 terminal of the output of the FPGA. The output of running the program will appear in this terminal window.
- 3. Open a new terminal window, run **nios2\_shell**, and then run **nios2-download –g <path to quartus project>/software/hello\_world/hello\_world.elf**. This will result in an output appearing on the other terminal window. If you create a new project in eclipse, it will always appear at **<path to quartus project>/software/** and contain a .elf file which is what runs on the NIOSII processor.
- 4. If the **Run Configurations** dialog box appears, verify that **Project name** and **ELF file name** contain relevant data, then click **Run**.

In all configurations, you may encounter problems launching your application on the FPGA resulting in:

*[Target Connection]: Connected system ID hash not found on target at expected base address.*

To mitigate this, ensure the most recently compiled .sof file is programmed to your FPGA which should resolve any timestamp and system ID differences. To see the system ID and timestamp values, click the System ID Properties button. If the system ID and/or timestamp report "Not Found", power cycle the board and reprogram. If this does not resolve the "Not Found" issue, restart eclipse and Quartus and recompile and

reprogram the Quartus design. You are advised to run both *binary\_counter* and *hello\_world* examples to familiarise yourselves with the system.

If you get the message "Downloading ELF Process failed", then this may be for a number of reasons. Unfortunately, the tools (even though they are commercial tools, have bugs). Possible steps to solve the problem:

- 1. Make sure that the Programmer window from Quartus is closed.
- 2. You Quartus design may be wrong. Check the connections carefully and also check whether you have connected the reset input to Vcc.
- 3. Restart the board and Eclipse. Try to resynthesize the design and downloaded again to the board.
- 4. If the launch of the program gets suck at "Launching: Flush any cached byte stream chars from input stream", restart Eclipse and try again.

#### If you encounter

#### **nios2-terminal: Unable to connect to JTAG UART because another application**

#### **nios2-terminal: is using it (or is using something else which clashes). Please close**

#### **nios2-terminal: the other application and try again.**

Ensure that there are no other connections to the nios2 terminal, close eclipse, power cycle the FPGA, redownload the .sof file using programmer, and then try again.

When the target hardware starts running the program, the Nios II Console view displays character I/O output.

2. To disconnect form the FPGA, in the terminal hosting the NIOSII terminal, just press ctrl+C and it will disconnect from the FPGA. You can edit the **hello\_world.c** program in the Nios II SBT for Eclipse text editor and repeat these two steps to witness your changes executing on the target board. If you rerun the program, buffered characters from the previous run session might display in the Console view before the program begins executing.

The final generated system has an Arithmetic Logic Unit (ALU) that supports fixed-point arithmetic, where all the arithmetic operators are implemented using LUTs. The FPGA on the DE1-SOC board is the Cyclone V SE5CSEMA5F31C6N FPGA. Make sure you keep track of your NIOS settings throughout the coursework.

In summary, you have designed a system in Quartus, and then moved on to program the system through Eclipse. If you make any changes in the hardware, you need to regenerate the BSP file. Otherwise, you can create new SW programs targeting the same HW without going back to Quartus.

Further details on the use of command line instructions (cli) can be found in the "nios2\_cli.pdf".

 $\mathbf X$  Record the FPGA resources used by your baseline system (i.e. Total logic elements, Total memory bits, Embedded Multiplier 18-bit elements, and Total PLLs).

#### <span id="page-14-0"></span>Task 2: Computing a simple function

Now that you have your system running on the FPGA, you can start using it to evaluate mathematical expressions.

From here on out, it is recommended to create backups of your project. Copying the system template de1 soc directory is enough to create a backup. Be sure that if working in a copied directory to update any file paths that may have changed.

The first task is to write a function that takes as input a vector *x* and its length *N*, and returns *y* where  $y = \sum_{i=1}^N x_i + x_i^2$  . The elements of the input vector are single-precision (floats) and are in the range between 0 and 255.

Starting from the **Hello World** template, modify the hello\_world.c file as follows:

```
#include <stdlib.h>
#include <sys/alt_stdio.h>
#include <sys/alt_alarm.h>
#include <sys/times.h>
#include <alt_types.h>
#include <system.h>
#include <stdio.h>
#include <unistd.h>
// Test case 1
//#define step 5
//#define N 52
// Test case 2
#define step 1/8.0
#define N 2041
//Test case 3
//#define step 1/1024.0
//#define N 261121
// Generates the vector x and stores it in the memory
void generateVector(float x[N])
{
         int i;
         x[0] = 0;for (i=1; i<N; i++)
                   x[i] = x[i-1] + step;}
float sumVector(float x[], int M)
{
         // YOUR CODE GOES HERE
}
int main()
{
   printf("Task 2!\n");
   // Define input vector
   float x[N];
   // Returned result
   float y;
   generateVector(x);
   // The following is used for timing
   char buf[50];
   clock_t exec_t1, exec_t2;
   exec_t1 = times(NULL); // get system time before starting the process
   // The code that you want to time goes here
  y = sumVector(x, N); // till here
   exec_t2 = times(NULL); // get system time after finishing the process
   int i;
   for (i=0; i<10; i++)
         y = y/2.0; return 0;
}
```
Complete the code in sumVector() function and compute the results for the three test cases presented in the beginning of this coursework. Increase the onchip memory as much as you can to accommodate the maximum problem size.

# **NOTE: You should assume that the vector** *x* **is stored in the memory prior calling your function. Not all three test cases need to be in the memory of the system at the same time. This assumption applies to the whole project.**

Provide answers to the following questions:

How many resources does your system occupy on the FPGA?

Is your system able to evaluate the given function for all test cases? Why not?

What is the size of your NIOS II application? How does it correspond to different vector sizes and what is the maximum memory required for each of the test cases?

Compare your results to Python. Do the results agree? If not, why not?

What is the time required to evaluate the sumVector() function for each test case? If you require more accuracy in the calculation of the time required by sumVector() function, you can call the function a number of times and get the average time spend in the function.

In designing HW systems, there is a relationship between the HW resources used by your system and the performance that you get out of it. If we assume that the accuracy is fixed and is not affected by the HW resource allocation, then the performance in your case is the latency of the function sumVector(). The best design is the one that provides the least latency for the least possible resources. Where is your current design in that space?

More detailed information about programming NIOS II in EDS is offered here:

[http://www.altera.com/literature/hb/nios2/n2sw\\_nii52017.pdf](http://www.altera.com/literature/hb/nios2/n2sw_nii52017.pdf)

 $\frac{f(t)}{\sqrt[4]{\sqrt[4]{\pi}}}$  To debug your application running on NIOS II use  $\frac{1}{\sqrt[4]{\pi}}$  instead of  $\Box$  Make use of the perspectives to switch between NIOS II project and debug (**Window Open perspective**). You're encouraged to make use of the available debug tools, e.g. watch and breakpoint, to facilitate the development of the software.

Good to know:

The printf() command in NIOS by default does not support floating point printing on the screen in order to reduce the size of the generated code. As you have only available the on-chip memory to store your program, the result can be produced by printing to the output a scaled down integer. Thus, if result is your floating point result, you can produce the result as follows:

For (int i=0; i<<10; i++)

resultf = resultf/2.0;

printf("Result: %d \n",(int) result);



Good to know:

Since we are interested in accelerating a certain computational process we need to be able to know how long it takes to execute that process. The time required to complete any statement/block of statements can be determined using the following statements in C code:

```
char buf[40]; 
clock t exec t1, exec t2;
exec t1 = times(NULL); // get system time before starting the process
*** your C statement(s) here ***
exec t2 = times(NULL); // get system time after finishing the process
gcvt((exec t2 - exec t1), 10, buf);
\text{alt\_putstr}(" proc time = "); alt_putstr(buf); alt_putstr(" ticks \n");
which requires the following includes:
```

```
#include <stdlib.h>
#include <sys/alt_stdio.h>
#include <sys/alt_alarm.h>
#include <sys/times.h>
#include <alt_types.h>
#include <system.h>
#include <stdio.h>
#include <unistd.h>
```
It also requires the Interval Time block to be in your system and running in the Hardware Abstraction Layer (HAL). Please refer to the tutorial for more details about this. Please note that the above code may require extra memory that your system may not have.

#### $\odot$ Good to know:

See command nios2-elf-size to get the size of your application. Alternatively, you can look at the .map file, or at the output of the building project process.

Read the reference about application profiling at the end of this hand-out.

# <span id="page-18-0"></span>Task 3: Storing the Program and Data on External Memories

The system that you created has its application stored on on-chip memory. Despite the fact being the fastest one available on an FPGA, it is the smallest and the most expensive one available in your system. On the DE0 boards there are different types of memories, including SDRAM memory. The main benefit of using SDRAM is that it has much more capacity and it has lower cost. That's why it is often used in real-life applications. The main drawback of using these memories is their interface, which requires a controller, and a consequent penalty on the system's complexity and performance, when compared to on-chip memory.

In this task, we will assess how the SDRAM memory allows to address problems with large memory requirements but also how its performance impacts the overall performance of the system.

The Altera DE1-SOC board contains an SDRAM chip that can store 64 Mbytes of data. The SDRAM chip requires careful timing control. To provide access to the SDRAM chip, Platform Designer implements an SDRAM Controller circuit. This circuit generates the signals needed to deal with the SDRAM chip. The figure below shows the resulting system by including the SDRAM controller (Please note that the "Switches parallel input interface" block is missing from your current system). You can notice that all the connections to on-chip memory and SDRAM controller are through the Avalon switch fabric.



The signals needed to communicate with this chip are shown in the figure below. All of the signals, except the clock, can be provided by the SDRAM Controller that can be generated by using Platform Designer. The clock signal is provided separately. It has to meet the clock-skew requirements (it will be explained in the lectures). Note that some signals are active low, which is denoted by the suffix N. Due to clock skewness a PLL is needed to generate a shifted version of the clock that drives the SDRAM.



In order to include the SDRAM controller in your existing system, open your current design in Platform Designer and in the IP Catalog select Memory Interfaces and Controllers > SDRAM > SDRAM Controller Intel FPGA IP and click Add in the new window.

Set the Data Width parameter to 16 bits and leave the default values for the rest. The rest of the parameters of the SDRAM controller for the DE1-SOC are given below.

```
Memory Profile: 
bits: 16; chip select: 1; banks: 4; row: 12; col: 8;
Timing:
cas: 3; init: 2; refresh: 15.625; delay: 100; t_rfc: 70; t_rp: 20; t_rcd: 20; 
t ac: 5.5; t wr: 14.0;
```
Click Finish. Now, in your final system, there will be an sdram module added to the design. Rename the module to **sdram,** and make the clk and reset connection as in the other modules. Connect the s1 signal to both cpu.data\_master and cpu.instruction\_master. Export the *wire* signal by double clicking on the export column.

So far, we have instantiated an SDRAM controller in the design, and have made the necessary connections to the rest of the design. We have also told to the tool to export the wire signal, which will be connected to the SDRAM memory directly.

The next stage is to integrate this memory in the memory address space of the system and inform the CPU where our reset\_vector and exception\_vector addresses are. Deselect the check box for onchip memory as this is now no longer necessary to keep active.

Select the command System > Assign Base Addresses. Observe that the SOPC Builder assigned the base address 0x00800000 to the SDRAM. To make use of the SDRAM, we need to configure the reset vector and exception vector of the Nios II processor. Right-click on the cpu and then select Edit. Select sdram to be the memory device for both reset vector and exception vector. Click Finish to return to the main window regenerate the system.

Close Platform Designer and go back to Quartus. Right click on your system and select Update System or Block. This should now reflect the extra wire connections from the SDRAM controller that we selected to export. Connect these to the relevant DRAM output pins already existing in your design.

#### **Using the Clock Signals IP Core**

The clock skew depends on physical characteristics of the DE1-SOC board. For proper operation of the SDRAM chip, it is necessary that its clock signal, DRAM\_CLK, leads the Nios II system clock, iCLK\_50, by 3 nanoseconds.

This can be accomplished by using a phase-locked loop (PLL) circuit which can be manually created using the MegaWizard plug-in.

To add the PLL IP core, in the IP Catalog tab search for the PLL Intel FPGA IP and click the Add button. Select either Verilog or VHDL (either is fine) and then name the unit pll.v or pll.vhdl respectively. The Megawizard plugin will open.

There are many parameters that you can set in the block, but what you need is to provide an input clock of 50MHz and provide two output clocks at the same frequency but one of them with some phase difference enough to compensate for the clock skew.

Provide the 50MHz reference clock frequency. In the Output Clocks section, set Number of Clocks to the required value. outclk0 clock should be kept the same as the input clock (i.e. same freq and zero phase diff). outclk1 should be set to have a 50MHz freq, but the requested phase diff to be -2.55ns (You may have to finetune the PLL time shift to a different value for your particular DE1-SOC board, but the value should be between -2.62ns and -2.50ns (a phase shift of about -46 degrees for 50MHz clock). No other clock is needed, so click on Finish. (There is no need to have a reset and lock signal in your PLL). One indicator of such problem is the absence of output in the console terminal, when you run the application or the fact that ELF failed to download/verify.

 $\mathbf{\hat{x}}$  Place the new block in the design and make the necessary connections to the clocks. Which clock output from the PLL should drive which module?

You also can place the PLL inside Platform Designer. The only difference is if you have to change the parameters of the PLL, you will have to re-generate your Platform Designer system, whereas if you have the PLL in your top design, you'll only need to re-synthesize the design.

**((ii)** Please note that SDRAM is very sensitive to clock variations. It is recommended that you use an external PLL in your design.

You can read more about this here[: http://www.altera.com/literature/ug/ug\\_embedded\\_ip.pdf](http://www.altera.com/literature/ug/ug_embedded_ip.pdf)

To assess the correct functioning of your system with SDRAM, change your application so that it prints a different statement in the console output. Below there's the example of the **count binary** template application. Please note that you can use the whole support for the C++ language as you have enough memory to store your programs (i.e. printf to print the floating point number).



 $\mathbf{\hat{x}}$  Evaluate the performance of your system using the same test cases as in the introduction of this handout and discuss the obtained results. Answer the following questions:

How long does it take to complete the evaluation of the function for each of the test cases?

What is the size of your NIOS II application for the different vector sizes?

How many resources does the final system occupy on the FPGA?

Investigate the impact of increasing the cache memory size of the NIOS processor to the required resources and the achieved latency of the system.

#### **Hints:**

- If you experience problems running your application on the updated system, try to create a new Board Support Package (BSP) project and copy only the C source files from your previous project**. It is advised to create a new project with BSP**. Avoid using existent BSPs after updating your system.
- If you copy the folder of your previous project, pay attention to the paths on files: **create-this-bsp** and **settings.bsp**. They may not be relative to the new project folder!
- Stop the execution of your application before trying to re-program the FPGA, as you may lose connection to it.
- As you have now added off-chip memory to your system, you can use the traditional way to output information on the screen and make use of the available libraries.

#### <span id="page-22-0"></span>Task 4: Evaluate a More Complex Mathematical Expression

By completing Task 3, you should have a good understanding on how an embedded digital system is constructed and how certain design decisions affect its performance. Your basic embedded system is based on the NIOS processor and has a large amount of memory (64MB). From now on, we will focus on the initial objective of the project, which was to evaluate the following function and design accelerators to improve the design's performance.

```
f(x) = \Sigma(0.5*x + x^2 \cos((x - 128)/128))
```
As now your system includes access to enough memory, the math.h library can be utilised to help with the evaluation of the cosine function.

 $\mathbf{\hat{X}}$  Make use of the math.h library and implement the new mathematical expression in software.

Run the application for all three test cases and report the performance of your design regarding: HW resources, latency to evaluate the function in seconds, and accuracy of the final result (compared to a MATLAB double precision implementation).

**Note:** For the final assessment, an extra test case will be used which generates N random floats to populate x based on a seed provided during the assessment. This will require the following function and #defines to be present in your source code.

```
// Test Case 4
#define N 2323
#define RANDSEED 334
void generateRandomVector(float x[N])
{
     int i;
     srand(RANDSEED);
    for (i=0; i\le N; i++) {
        x[i] = ((float) rand() / (float) RAND MAX) * MAXVAL; }
}
```
#### <span id="page-22-1"></span>Task 5: Add Multiplier Support

When you first created your NIOS II it had no multiplication support, which means that the floating-point multiplications have been emulated in software at the expense of fixed-point addition operations.

Now you're going to add hardware support for multiplication. NIOS II provides three possibilities: Logic Elements, 3 16-bit multipliers and 1 32-bit multiplier .



LUT-based multipliers are often used on FPGAs due to the flexibility of their word-length configuration, allowing faster multiplications at the expense of computing fewer bits than generic fixed word-length hardmultipliers. In the case of floating-point arithmetic the significand (also known as mantissa) has 23 bits, which is long enough to benefit from the performance of dedicated embedded multipliers.

So far you have been using floating-point arithmetic with non-floating-point units on your system. The software has been emulating these functions relying on a basic ALU implemented with generic LUTs (FPGA's basic logic elements). The FPGA that you are using, Cyclone V has dedicated hardware multiplier blocks that can be used to do fixed-point multiplications faster. Please note that you just add support for integer multiplication and the floating-point arithmetic has to be emulated again (signs, exponents, rounding and packing).

 $\mathbf{\hat{X}}$  In order to improve the performance of your system, we are going to add HW support for fixed-point multiplications.

Go back to the Platform Designer, and edit the **cpu**. Select the **3 16-bit multipliers** from the drop-down menu in Platform Designer, update the system and compare the new performance results with the previous version of your system for all three test cases. Report the execution time (i.e. latency) of the function that computes F(x), resource utilisation, code size, and error between the result of your system and the reference double precision Matlab implementation for each test case.

Assuming that maximum performance is required by the application, which design choice should be selected? Justify your choice.

<span id="page-23-0"></span>Task 6: Add Hardware Floating-Point Units

FLOATING-POINT (BACKGOURND)

In many applications there is a requirement to process data with large dynamic range and approximate data that takes values from the real number set. A possible way to represent the above numbers in a physical system is to use the "universally adopted" IEEE-754 floating-point standard.

The intention of the IEEE Standard for Binary Floating-Point Arithmetic, also known as IEEE-754, is to specify floating-point formats, arithmetic, conversions and exceptions. Floating-point numbers have three components: sign, significand and exponent, generically given by the following equation:

# x = ±significand x 2 exponent

The sign is represented with one bit indicating whether the number is positive (0) or negative (1). The significand represents the fractional part of the number, in a fixed-point format. The exponent indicates the magnitude of the number, and has an embedded biased value. The number of bits of the significand determines its precision. In the IEEE standard, the significand is a real number in the interval [1; 2[. The integer component, value 1 is not included, so the significand contains only the fractional part of the floatingpoint value. The standard defines two formats: single precision (32-bit) and double precision (64-bit). In this coursework only the single precision format is considered. This format uses 8 bits to represent the exponent and 23 bits + 1 hidden bit for the significand, as shown in the figure below.



Some values encoded using single precision floating-point representation:



Example:  $\pi$  = 3.1415926535897932384626433832795

To be between 1 and 2 we need to divide it by two, which means the exponent needs to be 127 + 1.

With 23 bits left we represent digits from the left to the right of 0.57079632679489661923132169163975



So far there's no support for floating-point arithmetic in NIOS II. All your floating point operations are emulated through SW. Now you are going to add floating-point adder, subtraction and multiplier hardware, so that the elementary arithmetic operations used by the function can be performed faster, thereby reducing the execution time, in the expense of dedicating more FPGA resources.

In order to implement such functionality in dedicated hardware you'll need to design a hardware block that performs this functionality, attach it to a bus in the system so the processor can access it, and then create a custom instruction so you can access it from your software.

To design the hardware block in Quartus II you can use schematic capture<sup>2</sup>, Verilog or VHDL. Before you attach your HW block to the NIOS system, you are advised to make a functional simulation of your hardware block to verify that it works correctly. You can do simulations using Modelsim<sup>3</sup> or Active-HDL. After that, you will add it to the system using Platform Designer (**IP Catalog panel Project New Component**).



In Platform Designer, you can add **Floating-Point Hardware** from MegaWizard to your system, and then connect it to the **Custom Instruction Master** port on your processor. In this case the floating-point hardware will automatically be mapped to a custom instruction and all floating-point arithmetic in software will be using this hardware block. The main limitation of using this is that you don't control latency or any other parameters or functionality made available through the MegaWizard interface. Also, you won't be able to reuse these units inside your hardware blocks bypassing the CPU. Also, the focus is to learn how to design a new custom instruction in order to be able to design future acceleration units. **As such, the Floating-Point Hardware in Platform Designer is not permitted as a design choice in this project.**

NIOS II CUSTOM INSTRUCTIONS

NIOS II supports different types of custom instructions depending on the type of hardware block connected to it. The following diagram shows how the custom logic is interfaces with the rest of the system.

 $2$  Recommended (w/ help available), but you can use any alternative.

<sup>&</sup>lt;sup>3</sup> Again, recommended (w/ help available), but you can use another. There's a Modelsim tutorial at the end of this hand-out.



The supported custom instructions/logic blocks are:

- Combinatorial: single clock cycle logic block; In this case, your HW block should be able to produce a valid results within one clock cycle.
- Multi-cycle: fixed or variable number of clock cycles to complete execution;
- Extended: block capable of performing multiple operations (e.g. ALU); In this case, your block has an extra input that specifies the operation to be performed.
- Internal register file: logic blocks that access internal register files for I/O (e.g. Multiply-Accumulate unit);
- External interface: logic blocks that interface to logic outside of the Nios II processor's data-path.

A high level view of the steps that you need to follow in order to create your own custom instruction component is shown below:

- Do a block diagram / Verilog / VHDL with your circuit;
- If you're doing block diagram, generate Verilog from it;
- Simulate your block using Modelsim or one of the other tools. If it performs as intended move on, otherwise revise your circuit design;
- Add the new block to your system as a new component;
- Synthesise your system;
- Get FPGA resources, latency and throughput;
- Add the "software interface" for your new block.

#### **Add custom instruction tutorial**

In order to familiarise yourself with the process of adding a custom instruction to your existing system, the following section provides a tutorial on how to map a simple one-cycle fixed point multiplier to your system as a custom instruction. This tutorial should be followed in parallel with the ug nios2 cusotm instruction.pdf document [\(http://www.altera.com/literature/ug/ug\\_nios2\\_custom\\_instruction.pdf\)](http://www.altera.com/literature/ug/ug_nios2_custom_instruction.pdf) where more detailed information on the process is provided.

Firstly we need to identify the type of the custom instruction. As we need to implement a single one-cycle integer multiplier, the *Combinatorial* type is appropriate.



Details on the available types can be found in Table 1-1 of [http://www.altera.com/literature/ug/ug\\_nios2\\_custom\\_instruction.pdf](http://www.altera.com/literature/ug/ug_nios2_custom_instruction.pdf)

#### *Design the module*

Create a new project in Quartus, and insert a new symbol from the IP Catalog tab. You should select the LPM\_MULT (this is an integer multiplier). Configure your multiplier to take two 32 bit inputs. Select your multiplier to be unsigned and without any pipeline stages (no clock). Your final multiplier should produce a 32 bit output, which are the 32 LSBs from the LPM MULT component. This will produce a verilog module.

The Megawizard plugin has been known to cause some performance issues. To circumvent this, please setup ssh on the device following this link: [https://linuxize.com/post/how-to-enable-ssh-on-ubuntu-18-04/](https://linuxize.com/post/how-to-enable-ssh-on-ubuntu-18-04/?fbclid=IwAR2IBoT76cbHuG5MXUxxlkxQO2YEqL6Un86wlUqsO8ge1ADJkh56aijTiSI)

Once the ssh has been setup, as sudo edit "/etc/ssh/sshd\_config" and uncomment and edit the lines to produce:

**X11Forwarding yes X11UseLocalhost yes** Then access quartus by running **ssh –X localhost home/e2/intelFPGA\_lite/20.1/quartus/bin/quartus.** The performance of quartus should be fixed now.

#### *Verify its functionality*

After having instantiated your IP code, add input and output ports and simulate it to verify its functionality. The simulation must be done through ModelSim. Please note that you have kept only part of the result (32 LSBs).

Modelsim is the application where you can run a simulation of your design before deploying into the FPGA, saving development time.

- 1. Download from blackboard the file testbench\_1.v which contains a module tb.
- 2. Press **Assignments->Settings**, then go to **EDA Tool Settings**
- 3. For **Tool Type->Simulation** select **ModelSim-Altera** under **Tool Name.**
- 4. Then go to **EDA Tool Settings->Simulation** and set **Tool name** to **ModelSim Altera.**
- 5. Under **NativeLink settings**, select **Compile test bench:** and then press **Test Benches**
- 6. In the window that pops up, press **New**
- 7. Pick a name for test bench name, and write **tb** for the top level module in test bench.
- 8. For **File Name** navigate to the provided testbench\_1.v (downloaded from blackboard), press **Add** then **OK** and then **OK** again and then **OK** again.
- 9. In an editor, open testbench 1.v and modify it following the provided comments in the file to work as your testbench.
- 10. Select **Project->Add/Remove Files in Project...** and add testbench\_1.v and press **OK**
- 11. Run **Analysis and Synthesis** (there is no need to run place and route)
- 12. Select **Tools->Run Simulation Tool->RTL Simulation.** A ModelSim window will open, and select wave tab to see the waveform output.

The testbench 1.v file is a template that can be used to develop more complex testbenches for later designs. It works exactly like a Verilog file that can be used to specify inputs, outputs and clocks for the module being test. [\(https://verilogguide.readthedocs.io/en/latest/verilog/testbench.html\)](https://verilogguide.readthedocs.io/en/latest/verilog/testbench.html)

#### *Add the module to the existing system*

Switch back to the project containing the NIOS II. Open your system in Platform Designer and click on *New Component* in the IP Catalog window. Provide the name of your component, and in the Files tab add the vhdl or Verilog files that you have created for your component (i.e. these are the top-level vhdl or Verilog file and the vhdl or Verilog file of the lpm\_mult). Click Top Level Module and select the name of the top-level module of your custom instruction logic. To set up the custom instruction interfaces, follow these steps:

- 1. Click on the Signals & Interfaces tab.
- 2. Change the Type field for this interface to Custom Instruction Slave.
- 3. Ensure that a single interface remains. For the design example, maintain the interface name nios custom instruction slave.
- 4. For Clock Cycles, you need to type 0 for a variable multicycle type custom instruction, and otherwise type the number of clock cycles your custom instruction logic requires. In our case we need to type 0, but this has already been picked up by the tool.
- 5. In the general case, for Clock Cycle Type, type Variable for a variable multicycle type custom instruction, Multicycle for a fixed multicycle type custom instruction, or Combinatorial for a combinational type custom instruction.
- 6. For Operands, type the number of operands for your custom instruction.
- 7. For each port, change the Signal Type to the appropriate signal for that port.

If your custom instruction logic requires additional interfaces, either to the Avalon-MM fabric or outside the Platform Designer system, you can specify the additional interfaces in the Interfaces tab. Our example does not require additional interfaces.

Click Finish and Save. In the Platform Designer IP Catalog Library, under Project, select your custom module and click add to add the new instruction to the Platform Designer system. In the Connections panel, connect the new component component's nios custom instruction slave interface to the cpu component's custom\_instruction\_master interface.

Reassign the base addresses, generate the new NIOS system, return to Quartus, add the custom instruction's directory through the **Projects -> Add/Remove Files in Project**, and then recompile the whole design in Quartus.

Check whether the new design meets timing.

#### **Accessing the Custom Instruction from Software**

Adding a custom instruction to a Nios II processor results in a significant change to the Qsys system. In this section, you create and build a new software project using the Nios II software build flow, and run the software that accesses the custom instruction.

Create a new software project using the Hello World template for your system and name it *hello\_world\_custom\_instr*. After the file generation has been finished, click on *hello\_world\_custom\_instr\_bsp* and open the *system.h* file. Verify that a custom instruction macro has been generated as below and copy these lines into the top of your C source file after the #include of system.h and rename the custom instructions for use in your program. This is done as any changes made to the custom instructions will be lost if you regenerate the BSP.

#define ALT\_CI\_MUL\_0(A,B) \_\_ builtin\_custom\_inii(ALT\_CI\_MUL\_0\_N,(A),(B))

#define ALT\_CI\_MUL\_0\_N 0x0

The second line provides the opcode for your custom instruction, and you can verify that this is the opcode assigned to your instruction by Platform Designer. The first line describes the custom instruction (input arguments and return values) and the associated custom instruction built-in function (Section 3.2 of ug\_nios2\_custom\_instruction.pdf).

Without making any changed to the template, build and run the application on the actual hardware to verify its functionality.

Change the code of hello world.c file as follows and verify its functionality.

```
#include <system.h>
#include <stdio.h>
#define ALT_CI_CUSTOM_MUL_0_N 0x0
#define ALT_CI_CUSTOM_MUL_0(A,B) 
__builtin_custom_inii(ALT_CI_CUSTOM_MUL_0_N,(A),(B))
int main()
{
   printf("Hello from Nios II!\n");
   int a, b, c, d;
  a = 2;
  b = 4;c = a * b;
   printf("Multiplication result: %d\n", c);
  d = ALT CI CUSTOM MUL \theta(a,b);
  printf("Multiplication result from custom instr: %d", d);
   return 0;
}
```
Please note that you need to include *the system.h* file in order to access the custom instruction.

These concludes the tutorial on how you can instantiate a new hardware block in your system and create a new custom instruction to allow the NIOS processor to access it. More detailed information can be found in:



[http://www.altera.com/literature/ug/ug\\_nios2\\_custom\\_instruction.pdf](http://www.altera.com/literature/ug/ug_nios2_custom_instruction.pdf)

(Chapters 1, 2 and 4).

You can find more information on Intel's support on floating point IP cores and the necessary conversions on the following documents: [http://www.altera.com/literature/ug/ug\\_altfp\\_mfug.pdf,](http://www.altera.com/literature/ug/ug_altfp_mfug.pdf,) [https://www.intel.com/content/www/us/en/programmable/documentation/cru1439932898327.html,](https://www.intel.com/content/www/us/en/programmable/documentation/cru1439932898327.html)  [https://www.intel.com/content/www/us/en/programmable/quartushelp/17.0/hdl/mega/mega\\_list\\_mega\\_lp](https://www.intel.com/content/www/us/en/programmable/quartushelp/17.0/hdl/mega/mega_list_mega_lpm_d1353e88.htm) [m\\_d1353e88.htm](https://www.intel.com/content/www/us/en/programmable/quartushelp/17.0/hdl/mega/mega_list_mega_lpm_d1353e88.htm)

#### **Add custom instruction to your system**

Having gone through the previous section, you should be familiar with the process of adding custom instructions to a NIOS system. The evaluation of the target function involves heavy use of floating-point addition, subtraction and multiplication, which at the moment are emulated through SW. As the target is to accelerate the evaluation of the function, the aim of this section is to map this functionality to HW.

Implement the custom logic blocks for floating-point adder, subtracter and multiplier operations in order to support hardware acceleration of the above operations. Design the HW block, verify it, and then integrate it to the rest of the system. Make sure that you select a suitable

Which type of custom instruction did you choose (i.e. Combinatorial, Multi-cycle,...)? Justify your choice.

What is the FPGA resources allocated for the new block? What is the throughput and latency of the new HW block? How the latency of your HW block impacts the overall performance of the system?

Compare the new performance results with the best performance achieved from the previous version of your system for all three test cases. Report the execution time (i.e. latency) of the function that computes  $F(x)$ , resource utilisation, code size, and error between the result of your system and the reference double precision Matlab implementation for each test case.

In the case where you have pipelined your design, please comment on how well the pipeline is utilised? Justify your answers.

VERIFICATION: GENERATING A SYSTEM, SYNTHESISING AND TESTING IT (A.K.A. TRIAL-AND-ERROR) IS TIME CONSUMING. YOU SHOULD VERIFY YOUR DESIGN THROUGH SIMULATION BEFORE TRYING RUNNING IT ON THE FPGA. THIS IS HOW REAL-LIFE SYSTEMS ARE DESIGNED, AND YOU SHOULD FOLLOW THIS PRACTICE THROUGHOUT THE COURSEWORK.

Hints:

- Organize your hardware blocks in a subfolder named "./IP" in your project folder;
- Create a new block diagram file and place your floating-point units there along with the necessary control logic. Remember to use labels in your connections, this will help to identify them during simulation;
- Pay attention to the interface signals of your block. They'll be different for different types of custom instructions;
- The floating-point arithmetic units are added to your block diagram via the **IP Catalog tab**
- Use MegaWizard to create your other blocks. Add them to your design, as Platform Designer will only add the top HDL file to your design. DO NOT add your top block of your design to the project, only the sub-blocks;
- If you're not familiar with either Verilog or VHDL, do your design using block diagram and then convert it to Verilog using **File->Create/Update->Create HDL File From Current File**. While creating a new component in Platform Designer it will ask for a VHDL or Verilog file. Remember to create a new one, and update your component, every time you update the block diagram. Platform Designer will create a copy of your HDL file (top of your block) when you create the component to a different location. Updating your HDL file won't update your block in the system. You have to manually copy it and replace it or regenerate the component;
- Keep in mind that Intel's CRC example is different than your floating-point add/sub/mult. You can't *just* follow the steps as they are in the guide; you need to think about what needs to be changed;

- After generating the BSP, verify that you have the macros for your custom instruction in the file **system.h**. **Check if the arguments and return value of \_\_builtin\_custom\_?n?? agree (see manual Section 3.2**). If not, create a new macro in your source file. **DO NOT EDIT system.h!**
- Replace all floating-point arithmetic operands in your application with calls to your custom application. E.g. (float)x = (float)a \* (float)b;  $\rightarrow x$  = cust fp mult(a, b); .

 $\frac{(\mathbf{0})}{\langle \langle \rangle \rangle}$  Debug:

- In case you don't read the expected value from your custom instruction, place a constant at the output of your hardware unit and verify if that value is read by the instruction.
- Whenever you introduce changes in your design, use different constant inputs so you can verify that your design is being updated. Platform Designer stores a copy of your top HDL file in a different location than your ./ip folder. It is a good idea to have the connections correct in that design before adding it to the Platform Designer.
- Use the simulation to adjust the correct number of clock cycles required to complete the operation.
- If you fail to have any output from your instruction at all, try to implement a simple one (combinatorial) to verify that you are calling the custom instruction correctly from your software. E.g. load a constant value from it, negate one of the outputs, OR/AND/XOR your inputs, etc.

# <span id="page-32-0"></span>Task 7: Add Dedicated Hardware Block to compute the inner part of the arithmetic expression

You've seen that you can improve the performance of your system by implementing some of the most computationally demanding operations in hardware. In the present scenario you're going to extend your previous implementation and map the whole arithmetic expression as a custom instruction in your system.

 $\mathbf{\hat{X}}$  Add the following custom instruction to your system:

$$
y = 0.5 \times x + x^{2} \cos((x - 128) / 128)
$$
 [2]

by mapping all operators to hardware including the implementation of a CORDIC block to implement  $\cos(.)$  instead of using the software implementation from  $\text{math.h}$ . Do not forget that we are aiming for the best performing design (fastest execution time) but with the smallest resource utilisation. Your CORDIC block implementation should provide a mean square error *e* less or equal to 1x10-10 with confidence level of 95%, assuming an input x that follows a uniform distribution in the range [-1, 1] (i.e. the possible range of input given our problem).

At this point, we will decouple the development of the custom instruction from the NIOS system you have built so far.

**The aim is to design and build an IP that takes as in input a single-precision floating point number x, and computes y through (2) using single-precision number format, except for the cos() evaluation where fixedpoint will be used.**

You should design your block in Quartus, use Modelsim to verify its functionality and performance (cycle number), and check whether it meets timing after place and route. In your investigation, you need to consider two architectures for the CORDIC block, **both using fixed-point format**. One that is optimised for throughput and one that is optimised for latency given the target clock frequency. When you are happy with the performance of your module, then you will need to interface it with the NIOS system.

For verification reasons, you need to implement the above custom instruction in two steps:

#### *Step 1*

Design two modules, each based on the previously mentioned CORDIC architectures. Check the accuracy that is achieved when compared to a double precision SW version (you can use Matlab for this test).

(**Hint**: you should use the Monte Carlo simulation technique to determine the mean square error of your design. In Matlab, you can use the *Fixed-point numeric object*, "fi", to represent the fixed-point variables.)

#### **Please note that a look-up table-based implementation is not allowed for the implementation of the**

#### **cos(.) function, or the use of an existing IP block that performs this function.**

Comment on the two architectures of your CORDIC HW block. How many CORDIC stages do they have? How did you select the number of stages? How do they impact the performance of the function evaluation (i.e. accuracy and latency) for the 3 test cases?

#### *Step 2*

In the second step, you need to implement the intended custom instruction, and integrate your CORDIC IP with the rest of the blocks. Describe how data is passed in your new HW block. Comment on their performance vs resource utilisation when the two modules (architectures) are integrated in your NIOS system.

For each of the two architectures, report the execution time (i.e. latency) of the function that computes  $F(x)$ , resource utilisation, and error between the result of your system and the reference double precision Matlab implementation for each test case. Do you get the performance that you were expecting from each version? Which one is a better system?

You can find more information on Altera's support on floating point IP cores and the necessary conversions on the following document[: http://www.altera.com/literature/ug/ug\\_altfp\\_mfug.pdf](http://www.altera.com/literature/ug/ug_altfp_mfug.pdf)

#### <span id="page-33-0"></span>Task 8: Add Dedicated Hardware Block to compute the arithmetic expression

Throughout this coursework you have been accelerating part of your software by desiging dedicated hardware blocks in the hardware that accelerate part of the required computations. (Hopefully) your application has been computing the targeted arithmetic function faster through all these hardware support.

 $\mathbf{\hat{x}}$  The aim of this final task is to map the whole function in hardware and reduce the execution time as much as you can. This is an open-ended task that you are free to use any possible architecture with the only restriction beeing that you have to use the economy version of the NIOS II processor (NIOSII/e).

Some design Hints:

- You need to add a floating-point adder to perform the overall sum in hardware. In this way you can speed up part of the computations.
- Think whether you have taken full use of the pipeline capabilities of your hardware blocks. How often you can push a new value of the x vector to your system? How can you increase its throughput? Do not forget that your NIOS processor blocks after the execution of a custom instrunction and waits for the assertion of the done signal to retrieve the result and being able to move to the next instruction. State any assumptions that you may have in your implementation.

Comment on the architecture of your final system and on the control and data flow of your design. Describe how and what data is passed in your new HW block. Comment on its pipeline's utilisation.

Compare the new performance results with the best performance achieved from the previous version of your system for all three test cases. Report the execution time (i.e. latency) of the function that computes F(x), resource utilisation, code size, and error between the result of your system and the reference double precision Matlab implementation for each test case.

**Hall** Debug hint:

If by any chance your system fails to synthesize with an error similar to this:

Error (12006): Node instance "func\_accel" instantiates undefined entity "first nios2 system func accel"

Double click on the error, select the first line of that unit instantiation, and replace:

"first nios2 system func accel" with "func accel"

This happens because somewhere in the script to generate the NIOS II system, the instantiation of your hardware block got its name concatenated with the name of your system.

 $\mathbf{I}$ 

# <span id="page-34-0"></span>**MARKING SCHEME**

The distribution of marks to the tasks is shown below. Please note that the marking will take into consideration your design approach, your achieved results, as well as the quality of the submitted report.



# <span id="page-34-1"></span>**COURSEWORK SUBMISSION**

# <span id="page-34-2"></span>Proposed work schedule and Intermediate deliverables deadlines

You should register in a logbook all that you have accomplished including results. The logbook doesn't replace the report.

### **Schedule**





**Only Reports 2 and 3 are assessed. Report 1 should be submitted for feedback. If you are late in submitting your report, the usual College penalty on late submission is applied. In the case of delaying Report 1, a 10% penalty will be applied on the final overall mark.**

#### <span id="page-35-0"></span>Reports - Marking

Each group needs to submit three reports on their project according to the schedule above.

The reports should contain the following:

- a) Answers to any questions raised in the project hand-out, e.g. performance, FPGA resources, etc.;
- b) Any related design in block diagram;
- c) A detailed justification of the design choices that you have made;
- d) Any limitations of your design;

e) An elaboration of what you would have done differently if you were doing the project from the beginning (for example you may have found that you took a wrong design decision but it is too late to go back and fix it. It is important to elaborate on this (why you believe it is a wrong decision and what would be a better one).

Please note that special attention will be given on the justification of your design choices and how you have approached the problem. So please provide enough details around these issues.

Also note that the report should not contain any background info like literature review on NIOS. However, the report should provide information on why you have chosen specific algorithms for the calculation of the arithmetic expression.

Below I have listed some general points and good practices on the report writing. Please note that the list below is not exhaustive.

#### **General points**

1. Describe clearly the aim of the report and its task

2. Provide discussion on every result that you include. Does it make sense? What conclusions you can draw from the reported results?

3. Be precise when you try to justify the results. Do not just use phrases like "… good enough..", "..very slow…"

4. Avoid using screenshots that do not have any useful info.

5. Do not provide a description of the tutorial that you followed. If there is a specific step that gave you trouble or you think is important to mention, then do just that.

6. Have a conclusion section (task level/report). What did you see? (do not summarize what you did)

7. Pay attention to the structure of the report.

8. The report should be in pdf and you should submit your code only for the final report.

10. Provide names and group id on the first page of your report.

11. Think about adding graphs to make your points more clear.

12. Resources: Your system in the FPGA, and your Program (memory occupied)

Report 1 – Good points to include

1. Think about the cache. How does its size affects the performance of the system?

2. Think about the size of your program. Do you use dynamic memory allocation? When does the process fail? (never during execution). Perform an offline analysis on the memory requirement of the various test cases to see whether you can fit the data in your system.

3. Comment on the impact of finite precision. Float vs Double

4. Present your results using graphs and tables. It is better to argue your point in this way.

5. Always include a performance vs recourses graph in order to compare your various designs (present and past). Given that the aim is the same across the various tasks, it is interesting to see how the design of your system (moving functionality to dedicated HW) affects the performance and resource utilization of the system.

#### 6. **page limit: 5 pages**

Report 2 - Good points to include

1. Discuss the implementation of cos function. Consider how it has been implemented in math.h library in C. Consider other versions of cos like cosf. Consider the Taylor expansion for the cos evaluation.

2. Discuss the impact of cache size

3. Look into other code optimization approaches. Consider the use of constants rather using variables, as well as expressing the code in a different way.

4. **page limit: 8 pages**

Report 3 - Good points to include

1. Comparison to previous implementations

2. Elaboration on the selection of the latency, achieved performance, area resources, achieved precision

3. Elaboration on the CORDIC architectures and justification on the final selected architecture. Elaboration on the design decisions behind the CORDIC architecture for achieving the required accuracy level.

4. Elaboration on the final adder to complete the whole design (function evaluation in HW)

5. **page limit: 10**

Please note that any pages beyond the page limit will not be taken into account. The page limit only refers to the body of the report (excludes the first page and references)

#### <span id="page-36-0"></span>Code

Each group needs to upload the source files of their project **for the most advanced design**, at the end of the project.

The source files should be organised with subdirectories according to the coursework tasks. **You only need to submit the most advanced working design (task)**. Please include the following extensions:

.qpf - quartus project .qsf - quartus settings .sdc - timing constraints .sopcinfo - system info .qsys - system specification .vhd - vhdl source files .v - verilog source files .qip - megawizard component



.tcl - script .bdf - block diagrams .bsf - symbols .do - modelsim macro .mpf - modelsim project .c - C source files .h - C include files .bsp - bsp settings .map .hex .bin .cproject/.project - eclipse project defs

You can follow the instructions below to make a .ZIP file with the deliverables:

1. Make a copy of your work containing only the designs/software you want to submit.

2. start command line "cmd";

3. change drive and directory to be your DSD folder e.g. "e:" "cd /dsd14/";

4. execute the following command. It will copy your sources into a zip file;

```
zip dsd -rq . -i "*.h" "*.c" "*.bsp" "*.map" "*.qsf" "*.qpf" "*.sdc" "*.sopcinfo" 
"*.qsys" "*.vhd" "*.v" "*.qip" "*.tcl" "*.bdf" "*.bsf" "*.do" "*.mpf" "*.*project" 
"*.bin" "*.hex"
```
5. verify the zip file contains all the source files you want to submit for evaluation; 6. upload the file.

# <span id="page-38-0"></span>**APPENDIX**

# <span id="page-38-1"></span>Troubleshooting

Below there's a list of common problems and their solutions:





#### <span id="page-39-0"></span>Quartus II Tutorial

Quartus is the application that maps your design to a particular FPGA device. More information on Quartus can be found in

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/intro\_to\_quartus2. pdf

<span id="page-39-1"></span>

### <span id="page-39-2"></span>Application Profiling

In the quest to accelerate complex applications, it may not be obvious which instructions take most of the processor's time. To measure the performance of parts of your application you can make use of an external profiler or measure the time require. Altera provides an application note specific on this subject. You can find it here:



#### <span id="page-39-3"></span>Alternatives to NIOS II/Altera

Embedded systems with soft/hard processors and reconfigurable logic are common. Similar to Altera (NIOS II) there are alternatives, namely Zynq from Xilinx. The Zynq system has a hard processor occupying a region of the device, which cannot be used for reconfigurable logic, but on the other hand the processor has enough computational power to run Ubuntu on it.

Here are screenshots of the development environments for hardware (XPS) and software (SDK). You'll find them similar to the tools you're using in this coursework.

#### **Imperial College** London

# DSD - 2022





# <span id="page-41-0"></span>Remote Compilation

Design compilation can require quite a few resources from your computer, and if you have a slow computer or are using a VM, you may want to compile your project remotely (for reference, both the server and Level 5 computers take roughly 5 minutes to compile designs). The teaching server, **ee-mill1**, has Quartus 18.1 installed. To access this server, follow the instructions in the link below:

<https://wiki.imperial.ac.uk/pages/viewpage.action?pageId=250302869>

Once you have setup a connection, you need to do the following steps to run Quartus:

- 1. In the SSH shell command line, run: source /usr/local/altera/18.1/settings64.sh
- 2. Then to open Quartus, run: quartus

#### *(You can add the first command to your .bashrc file)*

This will open Quartus, which you can use to compile projects. If you plan to solely use the servers for compilation, it is recommended to keep all the project files here and only copy over the **.sof** and **.sopcinfo** files for programming the board and software development respectively. If you wish to move a whole project to the server, you are recommended to zip the project using the command in the *Code* appendix section.

To move files, if you are using MobaXterm, you can use the file explorer on the left, as shown below:

# **Imperial College**<br>London



For mac and linux users, you can use the scp command, like so:

*To server:* scp <local-file> <username>@ee-mill1.ee.ic.ac.uk:<remote-destination>

*From server:* scp <username>@ee-mill1.ee.ic.ac.uk:<remote-file> <local-destination>

\*\*\* end of document \*\*\*