RISC-V and Embedded Systems¶

RISC-V is the open-source instruction set architecture reshaping the chip industry. This file covers the RISC-V philosophy, the V vector extension, embedded ML inference, TinyML on microcontrollers, RISC-V in AI accelerators, and edge deployment constraints

Every chip architecture we have covered so far (x86, ARM) requires a licence. Intel and AMD pay for x86. Apple, Qualcomm, and every smartphone vendor pay ARM billions per year. RISC-V is different: it is an open standard. Anyone can design, manufacture, and sell RISC-V chips without paying royalties to anyone. This is changing the economics of chip design, especially for AI.

The RISC-V Philosophy¶

RISC-V (pronounced "risk five") was created at UC Berkeley in 2010 as a clean, modern RISC instruction set. The key principles:
- Open standard: the ISA specification is freely available. You can build a RISC-V CPU without licensing fees, NDAs, or legal agreements. This is like how Linux is to operating systems — anyone can use, modify, and build on it.
- Modular design: the base ISA (RV32I or RV64I) is minimal — just 47 instructions. Everything else is optional extensions: M (multiply/divide), A (atomic operations), F/D (floating point), C (compressed instructions), V (vector processing). You pick only what you need, keeping the chip small and efficient.
- No legacy baggage: x86 carries 45 years of backward compatibility. ARM carries 35 years. RISC-V starts clean, incorporating lessons learned from both. No obscure instructions that exist only for compatibility with 1980s software.
Who uses RISC-V: SiFive (general-purpose cores), Alibaba (Xuantie server cores), Western Digital (storage controllers, billions shipped), Espressif (ESP32-C3, popular IoT chip), and dozens of AI accelerator startups that use RISC-V for the control processor managing their custom compute units.

RISC-V Base Architecture¶

The base integer ISA (RV64I for 64-bit) has:
- 32 general-purpose registers (x0-x31, each 64-bit). x0 is hardwired to zero (useful for implementing common patterns without special instructions).
- Fixed 32-bit instruction width (with the C extension adding 16-bit compressed instructions for code density).
- Load-store architecture: like ARM, arithmetic operates on registers only. Memory access is through explicit load/store instructions.

# RISC-V assembly (for flavour — you will use C/C++)
add  x3, x1, x2      # x3 = x1 + x2
lw   x4, 0(x5)       # load word from address in x5
sw   x4, 8(x5)       # store word to address x5 + 8
beq  x1, x2, label   # branch if x1 == x2

The simplicity of the ISA makes RISC-V cores small and power-efficient. A minimal RV32I core can be implemented in ~10,000 gates (an ARM Cortex-M0 is ~12,000). This matters for embedded systems where every milliwatt and every square millimetre of silicon counts.

The V Extension: RISC-V Vector Processing¶

The V extension (RVV) adds scalable vector processing to RISC-V, similar to ARM SVE. Vector registers have a configurable length (VLEN), specified by the hardware (128 to 65,536 bits). Code is written to be vector-length agnostic: it works on any VLEN without recompilation.

#include <riscv_vector.h>

// Vector addition using RVV intrinsics
void vadd_rvv(const float* a, const float* b, float* c, int n) {
    while (n > 0) {
        // vsetvl: set vector length — processes min(n, hardware_max) elements
        size_t vl = __riscv_vsetvl_e32m1(n);

        // Load vl elements
        vfloat32m1_t va = __riscv_vle32_v_f32m1(a, vl);
        vfloat32m1_t vb = __riscv_vle32_v_f32m1(b, vl);

        // Add
        vfloat32m1_t vc = __riscv_vfadd_vv_f32m1(va, vb, vl);

        // Store
        __riscv_vse32_v_f32m1(c, vc, vl);

        // Advance pointers
        a += vl; b += vl; c += vl; n -= vl;
    }
}

vsetvl is the key instruction. It tells the hardware "I want to process this many elements" and the hardware responds "I can process this many" (capped by VLEN). The loop automatically adapts to any vector width, with no scalar cleanup needed (the last iteration simply processes fewer elements).
LMUL (length multiplier): RVV can group multiple vector registers together (m1, m2, m4, m8) to process more elements per instruction at the cost of fewer available registers. m1 uses one register per vector operand; m8 uses eight, processing 8x more elements but leaving only 4 register groups available.
Compared to x86 AVX (fixed 256/512-bit) and ARM NEON (fixed 128-bit), RVV's scalability is a major advantage for diverse hardware: the same code runs on a tiny embedded core (VLEN=128) and a high-performance server core (VLEN=1024+).

Embedded ML: TinyML¶

TinyML is machine learning on microcontrollers — devices with kilobytes of RAM, megahertz-class CPUs, and milliwatt power budgets. Think: a sensor that detects keywords ("Hey Siri"), an accelerometer that classifies gestures, or a camera that counts people, all running on a chip costing $0.50 with no internet connection.
The constraints are extreme:

Resource	Server GPU	Smartphone	Microcontroller
RAM	80 GB	6 GB	256 KB
Storage	TB	128 GB	1 MB
Compute	1000 TFLOPS	10 TFLOPS	0.001 TFLOPS
Power	700 W	5 W	0.001 W
Cost	$30,000	$500	$1

A model that fits on a server GPU ($O(10^{10})$ parameters) will not fit on a microcontroller. TinyML models have $O(10^4)$–$O(10^6)$ parameters and use INT8 or even INT4 quantisation.

TensorFlow Lite Micro (TFLM)¶

TFLM is Google's inference framework for microcontrollers. It runs quantised TensorFlow Lite models without dynamic memory allocation, without an OS, and with a ~20 KB binary footprint.

// TinyML inference on a microcontroller (simplified)
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"

// Model is compiled into a C array (const unsigned char model_data[])
const tflite::Model* model = tflite::GetModel(model_data);

// Allocate a fixed memory arena (no malloc!)
constexpr int kArenaSize = 10 * 1024;  // 10 KB
uint8_t tensor_arena[kArenaSize];

// Set up interpreter
tflite::MicroInterpreter interpreter(model, resolver, tensor_arena, kArenaSize);
interpreter.AllocateTensors();

// Set input
float* input = interpreter.input(0)->data.f;
input[0] = sensor_reading;

// Run inference
interpreter.Invoke();

// Read output
float* output = interpreter.output(0)->data.f;
if (output[0] > 0.8f) {
    trigger_alert();
}

Key constraints in this code:
- tensor_arena is statically allocated — no malloc, no heap. Embedded systems often have no dynamic memory allocator.
- The model is a const byte array, stored in flash memory (ROM), not loaded from a file system.
- The entire framework + model + runtime fits in a few tens of KB.

Model Optimisation for Edge¶

Getting a model to run on a microcontroller requires aggressive optimisation:
- Quantisation (chapter 18): convert float32 weights to INT8 (4x smaller, 2-4x faster on integer-only hardware). Post-training quantisation is simple; quantisation-aware training preserves more accuracy.
- Pruning: remove weights close to zero. Structured pruning (remove entire channels/heads) is more hardware-friendly than unstructured pruning (random zeros) because it reduces actual computation, not just storage.
- Knowledge distillation (chapter 6): train a small "student" model to mimic a large "teacher" model. The student achieves higher accuracy than training from scratch, because it learns from the teacher's soft predictions.
- Neural Architecture Search (NAS): automatically search for efficient architectures that fit within a hardware budget (latency, memory, power). MicroNets and MCUNet find architectures optimised for specific microcontrollers.
- Operator fusion: combine conv + batch norm + ReLU into a single fused operation, eliminating intermediate memory writes (the same principle as GPU kernel fusion, but even more critical when you have 256 KB of RAM).

RISC-V in AI Accelerators¶

Many AI accelerator startups use RISC-V not for running ML models directly, but as the control processor that manages the custom compute units:

┌─────────────────────────────────────────┐
│              AI Accelerator             │
│                                         │
│  ┌──────────┐    ┌──────────────────┐   │
│  │  RISC-V  │───→│  Custom Matrix   │   │
│  │  Control │    │  Multiply Unit   │   │
│  │  Core    │    │  (systolic array,│   │
│  │          │    │  custom dataflow)│   │
│  └──────────┘    └──────────────────┘   │
│       │                    │            │
│       ▼                    ▼            │
│  ┌──────────┐    ┌──────────────────┐   │
│  │  Memory  │    │  On-chip SRAM    │   │
│  │  Control │    │  (activation     │   │
│  │          │    │   buffer)        │   │
│  └──────────┘    └──────────────────┘   │
└─────────────────────────────────────────┘

The RISC-V core handles: loading model weights from external memory, scheduling layer execution, managing data flow between compute units, and communicating with the host (via PCIe, USB, or SPI). The heavy computation (matrix multiplies, convolutions) is done by the custom hardware, not the RISC-V core.
Why RISC-V for control: no licensing cost (critical for startups), customisable (add domain-specific instructions), small footprint (a control core does not need x86's complexity), and the open ecosystem enables rapid prototyping.
Examples: Esperanto Technologies (1000+ RISC-V cores for ML), Tenstorrent (RISC-V control + custom tensix cores), SiFive (RISC-V cores with vector extensions for edge ML).

Edge Deployment Constraints¶

Deploying ML at the edge (on-device, not in the cloud) introduces constraints that cloud deployment does not face:
Power: a battery-powered device might have a total power budget of 100 mW. Running a model that consumes 50 mW leaves only 50 mW for the rest of the system (sensors, radio, display). Power-aware inference schedules computation to avoid thermal throttling and extend battery life.
Latency: edge inference must often be real-time. A wake-word detector ("Hey Siri") must respond within ~200 ms. An autonomous driving perception system (chapter 11) must process frames within ~30 ms. Network round-trip to the cloud (50-200 ms) is too slow for these use cases.
Privacy: processing data on-device means sensitive data (medical images, voice recordings, personal photos) never leaves the device. This is a legal requirement in some jurisdictions (GDPR) and a user trust requirement everywhere.
Connectivity: edge devices may have intermittent or no internet connection. A model running on a Mars rover (chapter 11), a submarine, or a rural farm sensor must work entirely offline.
Cost at scale: deploying ML to a billion smartphones costs $0 per device (the hardware already exists). Deploying to a billion IoT sensors means each sensor's ML hardware budget is pennies. RISC-V's zero licensing cost matters enormously at this scale.

Coding Tasks (compile with g++ or riscv64-gcc cross-compiler)¶

Write a C program that simulates a TinyML inference pipeline: statically allocate a model buffer, run a mock forward pass, and measure resource usage. This teaches the embedded constraints (no malloc, fixed memory arena).

// task1_tinyml_sim.cpp
// Compile: g++ -O2 -o task1 task1_tinyml_sim.cpp

#include <iostream>
#include <chrono>
#include <cmath>
#include <cstring>

// Simulate a microcontroller: fixed memory arena, no dynamic allocation
static constexpr int ARENA_SIZE = 32 * 1024;  // 32 KB total RAM budget
static uint8_t arena[ARENA_SIZE];

// Simple 2-layer MLP: 784 -> 64 -> 10 (MNIST-like, INT8 weights)
struct TinyModel {
    int8_t w1[784 * 64];      // layer 1 weights: 50,176 bytes
    int8_t b1[64];             // layer 1 biases
    int8_t w2[64 * 10];       // layer 2 weights: 640 bytes
    int8_t b2[10];             // layer 2 biases
    // Total: ~51 KB → must go in flash (ROM), not RAM
};

// Check if model fits in flash
void check_model_fit(int flash_kb) {
    int model_bytes = sizeof(TinyModel);
    std::cout << "Model size: " << model_bytes << " bytes ("
              << model_bytes / 1024 << " KB)\n";
    std::cout << "Flash: " << flash_kb << " KB → "
              << (model_bytes <= flash_kb * 1024 ? "FITS" : "TOO LARGE") << "\n";
}

// Mock inference using the fixed arena for activations
void mock_inference(const int8_t* input, int8_t* output) {
    // Activations go in the arena (RAM), not allocated dynamically
    int8_t* act1 = (int8_t*)arena;            // 64 bytes for layer 1 output
    int8_t* act2 = (int8_t*)(arena + 64);     // 10 bytes for layer 2 output

    // Layer 1: simplified matmul (not real quantised matmul, just structure demo)
    for (int j = 0; j < 64; j++) {
        int32_t sum = 0;  // accumulate in int32 to avoid overflow
        for (int i = 0; i < 784; i++) {
            sum += (int32_t)input[i] * 1;  // mock: weight = 1
        }
        act1[j] = (int8_t)std::max(-128, std::min(127, sum / 784));  // quantise back
        act1[j] = act1[j] > 0 ? act1[j] : 0;  // ReLU
    }

    // Layer 2
    for (int j = 0; j < 10; j++) {
        int32_t sum = 0;
        for (int i = 0; i < 64; i++) {
            sum += (int32_t)act1[i] * 1;
        }
        act2[j] = (int8_t)std::max(-128, std::min(127, sum / 64));
    }

    std::memcpy(output, act2, 10);
}

int main() {
    std::cout << "=== TinyML Resource Budget ===\n";
    std::cout << "Arena (RAM): " << ARENA_SIZE << " bytes ("
              << ARENA_SIZE / 1024 << " KB)\n";
    check_model_fit(256);  // typical MCU flash

    // Activation memory used
    int activation_bytes = 64 + 10;  // layer 1 + layer 2 outputs
    std::cout << "Activation memory: " << activation_bytes
              << " bytes / " << ARENA_SIZE << " available\n\n";

    // Benchmark inference
    int8_t input[784];
    int8_t output[10];
    std::memset(input, 1, 784);

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 10000; i++) {
        mock_inference(input, output);
    }
    auto end = std::chrono::high_resolution_clock::now();
    double us = std::chrono::duration<double, std::micro>(end - start).count() / 10000;

    std::cout << "Inference latency: " << us << " us\n";
    std::cout << "At 160 MHz MCU (~6.25 ns/cycle): ~"
              << (int)(us * 160) << " cycles\n";

    std::cout << "Output logits: ";
    for (int i = 0; i < 10; i++) std::cout << (int)output[i] << " ";
    std::cout << "\n";

    return 0;
}

Write a C++ program that quantises float32 weights to INT8 and measures the compression ratio and quantisation error.

// task2_quantise.cpp
// Compile: g++ -O3 -o task2 task2_quantise.cpp

#include <iostream>
#include <vector>
#include <cmath>
#include <algorithm>
#include <numeric>

// Symmetric quantisation: map float range [-max, +max] to [-127, +127]
void quantise_symmetric(const float* input, int8_t* output, int n, float& scale) {
    float max_val = 0.0f;
    for (int i = 0; i < n; i++) {
        max_val = std::max(max_val, std::abs(input[i]));
    }
    scale = max_val / 127.0f;
    for (int i = 0; i < n; i++) {
        float scaled = input[i] / scale;
        output[i] = (int8_t)std::max(-127.0f, std::min(127.0f, std::round(scaled)));
    }
}

// Dequantise: INT8 back to float
void dequantise(const int8_t* input, float* output, int n, float scale) {
    for (int i = 0; i < n; i++) {
        output[i] = (float)input[i] * scale;
    }
}

int main() {
    const int N = 100000;

    // Simulate random weights (roughly normal distribution)
    std::vector<float> weights(N);
    for (int i = 0; i < N; i++) {
        // Simple pseudo-random normal-ish values
        float u1 = (float)(i * 7 % 997 + 1) / 998.0f;
        float u2 = (float)(i * 13 % 991 + 1) / 992.0f;
        weights[i] = std::sqrt(-2.0f * std::log(u1)) * std::cos(6.2832f * u2) * 0.1f;
    }

    // Quantise
    std::vector<int8_t> quantised(N);
    float scale;
    quantise_symmetric(weights.data(), quantised.data(), N, scale);

    // Dequantise and measure error
    std::vector<float> reconstructed(N);
    dequantise(quantised.data(), reconstructed.data(), N, scale);

    float max_error = 0.0f, total_error = 0.0f;
    for (int i = 0; i < N; i++) {
        float err = std::abs(weights[i] - reconstructed[i]);
        max_error = std::max(max_error, err);
        total_error += err;
    }

    std::cout << "=== Quantisation Results ===\n";
    std::cout << "Original:    " << N * 4 << " bytes (float32)\n";
    std::cout << "Quantised:   " << N * 1 << " bytes (int8) + 4 bytes (scale)\n";
    std::cout << "Compression: " << 4.0f << "x\n";
    std::cout << "Scale factor: " << scale << "\n";
    std::cout << "Mean abs error: " << total_error / N << "\n";
    std::cout << "Max abs error:  " << max_error << "\n";
    std::cout << "Max abs error / scale: " << max_error / scale
              << " (should be <= 0.5 quantisation levels)\n";

    return 0;
}

Write a C++ program that performs INT8 matrix multiplication with INT32 accumulation — the actual computation that runs on embedded ML accelerators.

// task3_int8_matmul.cpp
// Compile: g++ -O3 -o task3 task3_int8_matmul.cpp

#include <iostream>
#include <chrono>
#include <vector>
#include <cstdint>

// INT8 matmul with INT32 accumulation (what Tensor Cores and MCU accelerators do)
void matmul_int8(const int8_t* A, const int8_t* B, int32_t* C,
                 int M, int N, int K) {
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            int32_t sum = 0;
            for (int k = 0; k < K; k++) {
                sum += (int32_t)A[i * K + k] * (int32_t)B[k * N + j];
            }
            C[i * N + j] = sum;
        }
    }
}

// Float32 matmul for comparison
void matmul_f32(const float* A, const float* B, float* C,
                int M, int N, int K) {
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            float sum = 0.0f;
            for (int k = 0; k < K; k++) {
                sum += A[i * K + k] * B[k * N + j];
            }
            C[i * N + j] = sum;
        }
    }
}

int main() {
    const int M = 128, N = 128, K = 128;

    std::vector<int8_t> A_i8(M * K, 1), B_i8(K * N, 1);
    std::vector<int32_t> C_i32(M * N);

    std::vector<float> A_f32(M * K, 1.0f), B_f32(K * N, 1.0f);
    std::vector<float> C_f32(M * N);

    // Benchmark INT8
    auto start = std::chrono::high_resolution_clock::now();
    for (int t = 0; t < 100; t++) {
        matmul_int8(A_i8.data(), B_i8.data(), C_i32.data(), M, N, K);
    }
    auto end = std::chrono::high_resolution_clock::now();
    double i8_ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;

    // Benchmark FP32
    start = std::chrono::high_resolution_clock::now();
    for (int t = 0; t < 100; t++) {
        matmul_f32(A_f32.data(), B_f32.data(), C_f32.data(), M, N, K);
    }
    end = std::chrono::high_resolution_clock::now();
    double f32_ms = std::chrono::duration<double, std::milli>(end - start).count() / 100;

    double gflops_i8 = 2.0 * M * N * K / i8_ms / 1e6;
    double gflops_f32 = 2.0 * M * N * K / f32_ms / 1e6;

    std::cout << "INT8 matmul:  " << i8_ms << " ms (" << gflops_i8 << " GOPS)\n";
    std::cout << "FP32 matmul:  " << f32_ms << " ms (" << gflops_f32 << " GFLOPS)\n";
    std::cout << "INT8 speedup: " << f32_ms / i8_ms << "x\n";
    std::cout << "Memory: INT8 = " << M*K + K*N << " bytes vs FP32 = "
              << (M*K + K*N) * 4 << " bytes (4x less)\n";

    return 0;
}