Skip to content

Concurrency and Parallelism

Concurrency and parallelism are how programs do multiple things at once. This file covers the distinction between concurrency and parallelism, synchronisation primitives, classic concurrency problems, deadlock, lock-free data structures, parallel programming models, async programming, and scaling laws, the concepts that underpin multi-threaded servers, distributed training, and every modern application.

  • A single CPU core executes one instruction at a time. But modern systems have 8, 64, or even thousands of cores (GPUs). And even on a single core, we want to handle multiple tasks: download a file while rendering a UI while processing user input. Concurrency and parallelism are the two strategies for managing multiple activities.

Concurrency vs Parallelism

Concurrency interleaves tasks on one core; parallelism runs tasks simultaneously on multiple cores

  • Concurrency is about managing multiple tasks. Tasks make progress by interleaving: task A runs for a bit, then task B, then back to A. On a single core, concurrency creates the illusion of simultaneous execution. The tasks are not truly simultaneous; they take turns.

  • Parallelism is about executing multiple tasks simultaneously. With \(n\) cores, \(n\) tasks can genuinely run at the same time. Parallelism requires multiple hardware execution units.

  • The analogy: concurrency is one chef alternating between chopping vegetables and stirring the pot. Parallelism is two chefs, each doing one task simultaneously. A system can be concurrent but not parallel (one core, interleaved tasks), parallel but not concurrent (multiple cores running independent programs with no interaction), or both (multiple cores running interleaved tasks that interact).

  • In ML, concurrency appears in data loading (overlap data preprocessing with GPU computation), while parallelism appears in distributed training (multiple GPUs computing gradients simultaneously, chapter 6).

Synchronisation Primitives

  • When multiple threads share data, synchronisation prevents race conditions. A race condition occurs when the outcome depends on the unpredictable order of thread execution.

  • Consider two threads both incrementing a shared counter: counter += 1. This is actually three operations: (1) read counter, (2) add 1, (3) write counter. If both threads read the same value (say 5), both add 1, and both write 6, the counter ends up at 6 instead of the correct 7. One increment is lost.

  • A mutex (mutual exclusion lock) ensures only one thread accesses a critical section at a time. A thread acquires the lock before entering the critical section and releases it after. Any other thread trying to acquire a held lock blocks until it is released.

lock.acquire()
counter += 1      # only one thread at a time here
lock.release()
  • Mutexes are correct but introduce contention: if many threads compete for the same lock, they spend time waiting instead of computing. This limits scalability. The extreme case, where all threads want the same lock, serialises the entire program.

  • A semaphore generalises a mutex. A counting semaphore maintains a counter: wait() decrements the counter (blocking if it would go negative), and signal() increments it. A semaphore initialised to 1 behaves like a mutex. A semaphore initialised to \(n\) allows up to \(n\) threads into the critical section simultaneously (useful for resource pools like database connections).

  • A condition variable lets a thread wait until a specific condition is met. The thread releases a lock, waits on the condition variable, and is woken up when another thread signals the condition. This avoids busy-waiting (repeatedly checking a condition in a loop, wasting CPU).

  • A monitor bundles a mutex with condition variables and shared data into a single abstraction. Java's synchronized keyword and Python's threading.Condition implement monitor-like semantics.

  • Read-write locks distinguish between readers (who can share access, since reading does not modify data) and writers (who need exclusive access). Multiple readers can hold the lock simultaneously, but a writer blocks all readers and other writers. This is optimal when reads far outnumber writes (e.g., a cached model serving predictions).

Classic Concurrency Problems

  • Producer-Consumer (bounded buffer): producers generate items and place them in a fixed-size buffer; consumers remove items. The challenges: producers must wait when the buffer is full, consumers must wait when it is empty, and both must avoid corrupting the buffer.

  • Solution uses two semaphores (one counting empty slots, one counting full slots) plus a mutex for the buffer itself. This is the pattern behind most message queues, logging systems, and data pipelines.

  • Readers-Writers: multiple readers can read simultaneously, but writers need exclusive access. The challenge is fairness: if readers keep arriving, the writer may starve (never get access). Solutions prioritise either readers or writers, or alternate fairly.

  • Dining Philosophers: five philosophers sit around a table with five forks between them. Each needs two forks to eat. If all five pick up their left fork simultaneously, nobody can pick up their right fork and everyone starves (deadlock). Solutions include: pick up both forks atomically, introduce asymmetry (one philosopher picks up the right fork first), or use a waiter (semaphore limiting diners to 4).

Deadlock

  • Deadlock occurs when a set of threads are each waiting for a resource held by another thread in the set, creating a cycle of dependencies. Nobody can proceed.

Deadlock: Thread A holds Lock 1 and wants Lock 2, Thread B holds Lock 2 and wants Lock 1 — circular wait

  • Four necessary conditions for deadlock (all must hold simultaneously):

    1. Mutual exclusion: the resource can only be held by one thread.
    2. Hold and wait: a thread holds one resource while waiting for another.
    3. No preemption: resources cannot be forcibly taken from a thread.
    4. Circular wait: a cycle exists in the wait-for graph.
  • Deadlock prevention breaks one of the four conditions:

    • Eliminate circular wait: impose a total ordering on resources. All threads acquire resources in the same order. If every thread always acquires lock A before lock B, a cycle is impossible.
    • Eliminate hold and wait: require threads to request all resources at once (atomically).
  • Deadlock avoidance dynamically decides whether granting a resource request could lead to deadlock. The Banker's algorithm maintains the maximum possible demand of each thread and only grants requests that leave the system in a "safe state" (a state where all threads can eventually complete). The algorithm is \(O(n^2 m)\) per request (\(n\) threads, \(m\) resource types), which is too expensive for most real systems.

  • Deadlock detection lets deadlocks happen, then detects them (by finding cycles in the wait-for graph) and recovers (by killing a thread or rolling back a transaction).

  • In practice, most systems use prevention (resource ordering) for common cases and detection for rare cases. Database systems are the classic example: they detect deadlocks among transactions and abort one to break the cycle.

Lock-Free and Wait-Free Data Structures

  • Locks introduce contention, priority inversion, and the risk of deadlock. Lock-free data structures avoid locks entirely, using atomic operations provided by the hardware.

  • The key atomic operation is Compare-And-Swap (CAS): atomically check if a memory location has an expected value, and if so, replace it with a new value. In pseudocode:

CAS(address, expected, new_value):
    if *address == expected:
        *address = new_value
        return true
    else:
        return false
  • CAS is implemented as a single hardware instruction, so it is atomic even without locks. Lock-free algorithms use CAS in a retry loop: read the current value, compute the new value, attempt to CAS. If another thread modified the value in the meantime, the CAS fails and the thread retries.

  • Lock-free: at least one thread makes progress in a finite number of steps (no deadlock possible, but individual threads might retry indefinitely under contention).

  • Wait-free: every thread makes progress in a bounded number of steps (strongest guarantee, but hardest to achieve).

  • Lock-free stacks, queues, and hash maps are widely used in high-performance systems. Java's ConcurrentHashMap and Go's atomic operations are built on CAS.

Parallel Programming Models

  • Shared memory parallelism: all threads access the same memory space. Synchronisation is the programmer's responsibility. OpenMP provides compiler directives to parallelise loops:
#pragma omp parallel for
for (int i = 0; i < n; i++) {
    result[i] = compute(data[i]);
}
  • The compiler splits the loop iterations across available cores. OpenMP is effective for data-parallel workloads (same operation on many data points) and is widely used in scientific computing.

  • Message passing parallelism: each process has its own memory. Communication happens by sending and receiving messages. MPI (Message Passing Interface) is the standard for distributed computing across nodes:

MPI_Send(data, count, MPI_FLOAT, dest, tag, MPI_COMM_WORLD);
MPI_Recv(data, count, MPI_FLOAT, src, tag, MPI_COMM_WORLD, &status);
  • MPI scales to thousands of nodes because there is no shared state to synchronise. Distributed deep learning (chapter 6) uses collective operations like MPI_AllReduce (ring all-reduce) to synchronise gradients across GPUs.

  • GPU parallelism follows the SIMT (Single Instruction, Multiple Threads) model: thousands of threads execute the same instruction on different data. This is perfect for matrix operations (chapter 2), where the same multiply-add is applied to every element. We will cover GPU programming in detail in a later chapter.

Async and Event-Driven Programming

  • Not all concurrency needs threads. Asynchronous programming handles many I/O-bound tasks with a single thread using an event loop.

  • The event loop maintains a queue of tasks. When a task needs to wait for I/O (network response, file read), it registers a callback and yields control. The event loop picks up the next ready task. When the I/O completes, the callback is queued and eventually executed. No thread is blocked during the wait.

  • Coroutines are functions that can suspend and resume. async/await syntax (Python, JavaScript, Rust) makes coroutines look like regular sequential code:

async def fetch_data(url):
    response = await http_get(url)  # suspends here, event loop runs other tasks
    return process(response)         # resumes when response arrives
  • The await keyword suspends the coroutine and returns control to the event loop. When the awaited operation completes, the coroutine resumes where it left off. This is cooperative multitasking: coroutines voluntarily yield, unlike preemptive multitasking where the OS forcibly switches threads.

  • Async is ideal for I/O-bound workloads with many concurrent connections (web servers handling thousands of clients). It is not suitable for CPU-bound work (the single-threaded event loop cannot utilise multiple cores). For CPU-bound work, use threads or processes.

  • Python's Global Interpreter Lock (GIL) prevents true parallelism with threads: only one thread can execute Python bytecode at a time. This is why Python uses multiprocessing (separate processes, each with its own interpreter) for CPU parallelism, and async for I/O concurrency. The GIL is being removed in Python 3.13+ (free-threaded Python), which will enable true multi-threaded parallelism.

Scaling Laws

  • Amdahl's law describes the theoretical speedup from parallelising a program. If a fraction \(p\) of the program is parallelisable and the rest \(1 - p\) is serial:
\[\text{Speedup}(n) = \frac{1}{(1-p) + \frac{p}{n}}\]

Amdahl's law: the serial fraction limits maximum speedup — 10% serial means 10x max, no matter how many cores

  • where \(n\) is the number of processors. As \(n \to \infty\), the maximum speedup approaches \(\frac{1}{1-p}\). If 95% of the program is parallel, the maximum speedup is \(\frac{1}{0.05} = 20\times\), regardless of how many cores you add. The serial portion is the bottleneck.

  • This has profound implications for ML: if data loading takes 10% of training time and is serial, adding more GPUs can at most speed up training by 10x. The 10% serial bottleneck limits everything (this is why efficient data pipelines and overlapping compute with I/O matter, chapter 6).

  • Gustafson's law offers a more optimistic perspective. Instead of fixing the problem size and adding processors, it fixes the total time and asks how much more work can be done. If the parallel portion scales with the problem size:

\[\text{Speedup}(n) = 1 - p + p \cdot n\]
  • This is linear in \(n\). The argument: with more processors, we solve bigger problems, not the same problem faster. In ML, this corresponds to increasing the batch size with more GPUs (weak scaling) rather than keeping the batch size fixed (strong scaling).

Coding Tasks (use CoLab or notebook)

  1. Demonstrate a race condition. Two threads increment a shared counter without synchronisation and observe the lost updates.

    import threading
    
    counter = 0
    
    def increment(n):
        global counter
        for _ in range(n):
            counter += 1  # NOT atomic: read, add, write
    
    threads = [threading.Thread(target=increment, args=(100000,)) for _ in range(4)]
    for t in threads: t.start()
    for t in threads: t.join()
    
    print(f"Expected: {4 * 100000}")
    print(f"Actual:   {counter}")
    print(f"Lost updates: {4 * 100000 - counter}")
    

  2. Fix the race condition with a lock and measure the overhead.

    import threading
    import time
    
    lock = threading.Lock()
    counter = 0
    
    def increment_locked(n):
        global counter
        for _ in range(n):
            with lock:
                counter += 1
    
    start = time.time()
    threads = [threading.Thread(target=increment_locked, args=(100000,)) for _ in range(4)]
    for t in threads: t.start()
    for t in threads: t.join()
    elapsed = time.time() - start
    
    print(f"Counter: {counter} (correct: {4 * 100000})")
    print(f"Time with lock: {elapsed:.3f}s")
    

  3. Visualise Amdahl's law. Plot speedup vs number of processors for different parallel fractions.

    import jax.numpy as jnp
    import matplotlib.pyplot as plt
    
    n_procs = jnp.arange(1, 65)
    
    for p, color in [(0.5, "#e74c3c"), (0.9, "#f39c12"), (0.95, "#27ae60"), (0.99, "#3498db")]:
        speedup = 1 / ((1 - p) + p / n_procs)
        plt.plot(n_procs, speedup, color=color, linewidth=2, label=f"p={p}")
        # Max speedup line
        plt.axhline(1 / (1 - p), color=color, linestyle="--", alpha=0.3)
    
    plt.xlabel("Number of processors")
    plt.ylabel("Speedup")
    plt.title("Amdahl's Law: Serial Fraction Limits Speedup")
    plt.legend()
    plt.grid(True)
    plt.show()