Vulkan Compute and Cross-Platform GPU¶

Vulkan is the only GPU compute API that runs on every major platform: NVIDIA, AMD, Intel, Apple (via MoltenVK), Android, and even the browser (via WebGPU). This file covers the Vulkan architecture, the compute pipeline, writing compute shaders in GLSL, the full C++ setup for a GPU compute program, shared memory and synchronisation, WebGPU for the browser, and practical ML inference examples

CUDA dominates ML training on NVIDIA hardware. But not every deployment target has an NVIDIA GPU. A mobile app runs on Qualcomm Adreno or ARM Mali GPUs. A web application runs in the browser. A game engine needs to support AMD, Intel, and NVIDIA simultaneously. For all of these, Vulkan is the answer.
Vulkan is verbose — a "hello world" compute program is ~300 lines of C++. But this verbosity is the price of explicit control: you manage every GPU resource (memory, pipelines, command buffers) yourself. This control enables maximum performance and portability, at the cost of development speed.

Vulkan Architecture Overview¶

Vulkan is a low-level GPU API created by the Khronos Group (the same organisation behind OpenGL). Unlike CUDA (which hides GPU resource management), Vulkan requires you to explicitly manage:
- Instance and device: create a Vulkan instance, enumerate available GPUs, and select one.
- Memory: allocate GPU memory explicitly, specifying the memory type (device-local for speed, host-visible for CPU access).
- Buffers: create buffer objects that reference the allocated memory.
- Descriptor sets: bind buffers to shader inputs (like function arguments for the compute shader).
- Compute pipeline: compile the shader and create a pipeline object.
- Command buffer: record a sequence of GPU commands (bind pipeline, bind descriptors, dispatch compute).
- Queue submission: submit the command buffer to the GPU for execution.
- Synchronisation: fences and barriers to ensure correct ordering.
This is radically different from CUDA's cudaMalloc + kernel launch model. In CUDA, the driver handles most of this behind the scenes. In Vulkan, you do it yourself.

Why So Verbose?¶

Vulkan's explicitness exists for two reasons:
1. Driver simplicity: OpenGL drivers were enormously complex (they had to guess what the application intended and optimise accordingly). Vulkan moves that responsibility to the application, making drivers thinner, more predictable, and easier to implement correctly across vendors.
2. Performance: explicit control over memory layout, synchronisation, and command batching lets the application make optimal decisions. In CUDA, the driver might insert unnecessary synchronisation. In Vulkan, you synchronise only when you need to.

Compute Shaders in GLSL¶

A compute shader is a program that runs on the GPU, similar to a CUDA kernel. It is written in GLSL (OpenGL Shading Language) and compiled to SPIR-V bytecode (a portable binary format).

Vector Addition¶

// add.comp — compile with: glslangValidator -V add.comp -o add.spv
#version 450

// Workgroup size: 256 invocations per workgroup (= threads per block in CUDA)
layout(local_size_x = 256) in;

// Buffer bindings (like kernel arguments)
layout(set = 0, binding = 0) buffer InputA { float a[]; };
layout(set = 0, binding = 1) buffer InputB { float b[]; };
layout(set = 0, binding = 2) buffer Output { float c[]; };

// Push constant: small uniform data (like a kernel parameter)
layout(push_constant) uniform PushConstants {
    uint n;  // number of elements
};

void main() {
    uint idx = gl_GlobalInvocationID.x;  // global thread index
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

Mapping to CUDA concepts:

Vulkan	CUDA	Meaning
Workgroup	Block	Group of threads that can share memory
Invocation	Thread	Single execution unit
`gl_GlobalInvocationID`	`blockIdx * blockDim + threadIdx`	Global thread index
`gl_LocalInvocationID`	`threadIdx`	Thread index within workgroup
`gl_WorkGroupID`	`blockIdx`	Workgroup index
`local_size_x`	`blockDim.x`	Threads per workgroup
Storage buffer	Global memory	Read/write GPU memory
Shared memory (`shared`)	`__shared__`	Per-workgroup fast memory
Push constant	Kernel argument	Small uniform data

ReLU with Shared Memory¶

// relu_shared.comp
#version 450

layout(local_size_x = 256) in;

layout(set = 0, binding = 0) buffer Input  { float input_data[]; };
layout(set = 0, binding = 1) buffer Output { float output_data[]; };

layout(push_constant) uniform PushConstants { uint n; };

// Shared memory (equivalent to CUDA __shared__)
shared float tile[256];

void main() {
    uint gid = gl_GlobalInvocationID.x;
    uint lid = gl_LocalInvocationID.x;

    // Load into shared memory
    if (gid < n) {
        tile[lid] = input_data[gid];
    }

    // Barrier: wait for all invocations in workgroup to finish loading
    barrier();  // equivalent to CUDA __syncthreads()

    // Compute ReLU
    if (gid < n) {
        output_data[gid] = max(tile[lid], 0.0);
    }
}

For ReLU, shared memory is not strictly necessary (the operation is element-wise). But this demonstrates the pattern: load to shared memory → barrier → compute → store. For operations that need data from neighbouring threads (convolution, reduction, softmax), shared memory is essential.

Parallel Reduction (Sum)¶

// reduce_sum.comp
#version 450

layout(local_size_x = 256) in;

layout(set = 0, binding = 0) buffer Input  { float input_data[]; };
layout(set = 0, binding = 1) buffer Output { float partial_sums[]; };

layout(push_constant) uniform PushConstants { uint n; };

shared float sdata[256];

void main() {
    uint gid = gl_GlobalInvocationID.x;
    uint lid = gl_LocalInvocationID.x;
    uint wgid = gl_WorkGroupID.x;

    // Load into shared memory
    sdata[lid] = (gid < n) ? input_data[gid] : 0.0;
    barrier();

    // Tree reduction within the workgroup
    for (uint stride = 128; stride > 0; stride >>= 1) {
        if (lid < stride) {
            sdata[lid] += sdata[lid + stride];
        }
        barrier();
    }

    // Thread 0 writes the workgroup's partial sum
    if (lid == 0) {
        partial_sums[wgid] = sdata[0];
    }
}

This is the classic parallel reduction pattern (same as CUDA). Each workgroup produces one partial sum. A second dispatch reduces the partial sums to a final result. The tree reduction halves the active threads at each step: 256 → 128 → 64 → ... → 1.

Matrix Multiply with Tiling¶

// matmul_tiled.comp
#version 450

#define TILE_SIZE 16

layout(local_size_x = TILE_SIZE, local_size_y = TILE_SIZE) in;

layout(set = 0, binding = 0) buffer MatA { float A[]; };
layout(set = 0, binding = 1) buffer MatB { float B[]; };
layout(set = 0, binding = 2) buffer MatC { float C[]; };

layout(push_constant) uniform PushConstants {
    uint M, N, K;
};

shared float tileA[TILE_SIZE][TILE_SIZE];
shared float tileB[TILE_SIZE][TILE_SIZE];

void main() {
    uint row = gl_GlobalInvocationID.y;
    uint col = gl_GlobalInvocationID.x;
    uint lr = gl_LocalInvocationID.y;
    uint lc = gl_LocalInvocationID.x;

    float sum = 0.0;

    for (uint t = 0; t < (K + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load tile of A and B into shared memory
        uint aCol = t * TILE_SIZE + lc;
        uint bRow = t * TILE_SIZE + lr;

        tileA[lr][lc] = (row < M && aCol < K) ? A[row * K + aCol] : 0.0;
        tileB[lr][lc] = (bRow < K && col < N) ? B[bRow * N + col] : 0.0;

        barrier();

        // Compute partial dot product
        for (uint k = 0; k < TILE_SIZE; k++) {
            sum += tileA[lr][k] * tileB[k][lc];
        }

        barrier();
    }

    if (row < M && col < N) {
        C[row * N + col] = sum;
    }
}

This is the same tiling algorithm as the CUDA version (file 04), just in GLSL syntax. The concepts are identical: load tiles into shared memory, barrier, compute, barrier, repeat.

The C++ Vulkan Setup¶

The compute shader is the easy part. The hard part is the C++ boilerplate that creates the Vulkan instance, allocates memory, binds buffers, and submits commands. Here is a condensed version of the full pipeline:

// vulkan_compute.cpp — a minimal but complete Vulkan compute example
// Compile: g++ -O3 -o vulkan_compute vulkan_compute.cpp -lvulkan
// Requires: Vulkan SDK installed, add.spv compiled from add.comp

#include <vulkan/vulkan.h>
#include <iostream>
#include <vector>
#include <fstream>
#include <cassert>

// Helper: read SPIR-V file
std::vector<uint32_t> readSPIRV(const std::string& filename) {
    std::ifstream file(filename, std::ios::ate | std::ios::binary);
    size_t fileSize = file.tellg();
    std::vector<uint32_t> buffer(fileSize / sizeof(uint32_t));
    file.seekg(0);
    file.read(reinterpret_cast<char*>(buffer.data()), fileSize);
    return buffer;
}

int main() {
    const uint32_t N = 1024;
    const size_t bufferSize = N * sizeof(float);

    // ========== 1. Create Vulkan Instance ==========
    VkApplicationInfo appInfo{};
    appInfo.sType = VK_STRUCTURE_TYPE_APPLICATION_INFO;
    appInfo.apiVersion = VK_API_VERSION_1_2;

    VkInstanceCreateInfo instanceInfo{};
    instanceInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;
    instanceInfo.pApplicationInfo = &appInfo;

    VkInstance instance;
    vkCreateInstance(&instanceInfo, nullptr, &instance);

    // ========== 2. Select Physical Device (GPU) ==========
    uint32_t deviceCount = 0;
    vkEnumeratePhysicalDevices(instance, &deviceCount, nullptr);
    std::vector<VkPhysicalDevice> devices(deviceCount);
    vkEnumeratePhysicalDevices(instance, &deviceCount, devices.data());
    VkPhysicalDevice physicalDevice = devices[0];  // use first GPU

    // Print GPU name
    VkPhysicalDeviceProperties props;
    vkGetPhysicalDeviceProperties(physicalDevice, &props);
    std::cout << "Using GPU: " << props.deviceName << "\n";

    // ========== 3. Find Compute Queue Family ==========
    uint32_t queueFamilyCount = 0;
    vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice, &queueFamilyCount, nullptr);
    std::vector<VkQueueFamilyProperties> queueFamilies(queueFamilyCount);
    vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice, &queueFamilyCount, queueFamilies.data());

    uint32_t computeFamily = 0;
    for (uint32_t i = 0; i < queueFamilyCount; i++) {
        if (queueFamilies[i].queueFlags & VK_QUEUE_COMPUTE_BIT) {
            computeFamily = i;
            break;
        }
    }

    // ========== 4. Create Logical Device and Queue ==========
    float queuePriority = 1.0f;
    VkDeviceQueueCreateInfo queueInfo{};
    queueInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;
    queueInfo.queueFamilyIndex = computeFamily;
    queueInfo.queueCount = 1;
    queueInfo.pQueuePriorities = &queuePriority;

    VkDeviceCreateInfo deviceInfo{};
    deviceInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;
    deviceInfo.queueCreateInfoCount = 1;
    deviceInfo.pQueueCreateInfos = &queueInfo;

    VkDevice device;
    vkCreateDevice(physicalDevice, &deviceInfo, nullptr, &device);

    VkQueue computeQueue;
    vkGetDeviceQueue(device, computeFamily, 0, &computeQueue);

    // ========== 5. Allocate Buffers (A, B, C) ==========
    // For brevity, this uses host-visible memory (slower but simpler)
    auto createBuffer = [&](VkBuffer& buffer, VkDeviceMemory& memory) {
        VkBufferCreateInfo bufInfo{};
        bufInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
        bufInfo.size = bufferSize;
        bufInfo.usage = VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
        vkCreateBuffer(device, &bufInfo, nullptr, &buffer);

        VkMemoryRequirements memReqs;
        vkGetBufferMemoryRequirements(device, buffer, &memReqs);

        // Find host-visible memory type
        VkPhysicalDeviceMemoryProperties memProps;
        vkGetPhysicalDeviceMemoryProperties(physicalDevice, &memProps);
        uint32_t memType = 0;
        for (uint32_t i = 0; i < memProps.memoryTypeCount; i++) {
            if ((memReqs.memoryTypeBits & (1 << i)) &&
                (memProps.memoryTypes[i].propertyFlags &
                 (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT))) {
                memType = i;
                break;
            }
        }

        VkMemoryAllocateInfo allocInfo{};
        allocInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
        allocInfo.allocationSize = memReqs.size;
        allocInfo.memoryTypeIndex = memType;
        vkAllocateMemory(device, &allocInfo, nullptr, &memory);
        vkBindBufferMemory(device, buffer, memory, 0);
    };

    VkBuffer bufA, bufB, bufC;
    VkDeviceMemory memA, memB, memC;
    createBuffer(bufA, memA);
    createBuffer(bufB, memB);
    createBuffer(bufC, memC);

    // ========== 6. Fill Input Buffers ==========
    float* ptrA;
    vkMapMemory(device, memA, 0, bufferSize, 0, (void**)&ptrA);
    for (uint32_t i = 0; i < N; i++) ptrA[i] = 1.0f;
    vkUnmapMemory(device, memA);

    float* ptrB;
    vkMapMemory(device, memB, 0, bufferSize, 0, (void**)&ptrB);
    for (uint32_t i = 0; i < N; i++) ptrB[i] = 2.0f;
    vkUnmapMemory(device, memB);

    // ========== 7. Create Compute Pipeline ==========
    auto spirvCode = readSPIRV("add.spv");
    VkShaderModuleCreateInfo shaderInfo{};
    shaderInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
    shaderInfo.codeSize = spirvCode.size() * sizeof(uint32_t);
    shaderInfo.pCode = spirvCode.data();
    VkShaderModule shaderModule;
    vkCreateShaderModule(device, &shaderInfo, nullptr, &shaderModule);

    // Descriptor set layout (tells Vulkan about the buffer bindings)
    VkDescriptorSetLayoutBinding bindings[3] = {};
    for (int i = 0; i < 3; i++) {
        bindings[i].binding = i;
        bindings[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
        bindings[i].descriptorCount = 1;
        bindings[i].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
    }

    VkDescriptorSetLayoutCreateInfo layoutInfo{};
    layoutInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
    layoutInfo.bindingCount = 3;
    layoutInfo.pBindings = bindings;
    VkDescriptorSetLayout descLayout;
    vkCreateDescriptorSetLayout(device, &layoutInfo, nullptr, &descLayout);

    // Push constant range
    VkPushConstantRange pushRange{};
    pushRange.stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
    pushRange.offset = 0;
    pushRange.size = sizeof(uint32_t);

    // Pipeline layout
    VkPipelineLayoutCreateInfo pipeLayoutInfo{};
    pipeLayoutInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
    pipeLayoutInfo.setLayoutCount = 1;
    pipeLayoutInfo.pSetLayouts = &descLayout;
    pipeLayoutInfo.pushConstantRangeCount = 1;
    pipeLayoutInfo.pPushConstantRanges = &pushRange;
    VkPipelineLayout pipelineLayout;
    vkCreatePipelineLayout(device, &pipeLayoutInfo, nullptr, &pipelineLayout);

    // Compute pipeline
    VkComputePipelineCreateInfo pipeInfo{};
    pipeInfo.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;
    pipeInfo.stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
    pipeInfo.stage.stage = VK_SHADER_STAGE_COMPUTE_BIT;
    pipeInfo.stage.module = shaderModule;
    pipeInfo.stage.pName = "main";
    pipeInfo.layout = pipelineLayout;
    VkPipeline pipeline;
    vkCreateComputePipelines(device, VK_NULL_HANDLE, 1, &pipeInfo, nullptr, &pipeline);

    // ========== 8. Descriptor Set (bind buffers to shader) ==========
    VkDescriptorPoolSize poolSize{};
    poolSize.type = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
    poolSize.descriptorCount = 3;

    VkDescriptorPoolCreateInfo poolInfo{};
    poolInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
    poolInfo.maxSets = 1;
    poolInfo.poolSizeCount = 1;
    poolInfo.pPoolSizes = &poolSize;
    VkDescriptorPool descPool;
    vkCreateDescriptorPool(device, &poolInfo, nullptr, &descPool);

    VkDescriptorSetAllocateInfo descAllocInfo{};
    descAllocInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
    descAllocInfo.descriptorPool = descPool;
    descAllocInfo.descriptorSetCount = 1;
    descAllocInfo.pSetLayouts = &descLayout;
    VkDescriptorSet descSet;
    vkAllocateDescriptorSets(device, &descAllocInfo, &descSet);

    // Write buffer references into the descriptor set
    VkDescriptorBufferInfo bufInfos[3] = {
        {bufA, 0, bufferSize}, {bufB, 0, bufferSize}, {bufC, 0, bufferSize}
    };
    VkWriteDescriptorSet writes[3] = {};
    for (int i = 0; i < 3; i++) {
        writes[i].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
        writes[i].dstSet = descSet;
        writes[i].dstBinding = i;
        writes[i].descriptorCount = 1;
        writes[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
        writes[i].pBufferInfo = &bufInfos[i];
    }
    vkUpdateDescriptorSets(device, 3, writes, 0, nullptr);

    // ========== 9. Record and Submit Command Buffer ==========
    VkCommandPoolCreateInfo cmdPoolInfo{};
    cmdPoolInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;
    cmdPoolInfo.queueFamilyIndex = computeFamily;
    VkCommandPool cmdPool;
    vkCreateCommandPool(device, &cmdPoolInfo, nullptr, &cmdPool);

    VkCommandBufferAllocateInfo cmdAllocInfo{};
    cmdAllocInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
    cmdAllocInfo.commandPool = cmdPool;
    cmdAllocInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
    cmdAllocInfo.commandBufferCount = 1;
    VkCommandBuffer cmdBuf;
    vkAllocateCommandBuffers(device, &cmdAllocInfo, &cmdBuf);

    VkCommandBufferBeginInfo beginInfo{};
    beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
    vkBeginCommandBuffer(cmdBuf, &beginInfo);

    vkCmdBindPipeline(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
    vkCmdBindDescriptorSets(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE,
                            pipelineLayout, 0, 1, &descSet, 0, nullptr);
    vkCmdPushConstants(cmdBuf, pipelineLayout, VK_SHADER_STAGE_COMPUTE_BIT,
                       0, sizeof(uint32_t), &N);
    vkCmdDispatch(cmdBuf, (N + 255) / 256, 1, 1);  // launch workgroups

    vkEndCommandBuffer(cmdBuf);

    // Submit
    VkFenceCreateInfo fenceInfo{};
    fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
    VkFence fence;
    vkCreateFence(device, &fenceInfo, nullptr, &fence);

    VkSubmitInfo submitInfo{};
    submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
    submitInfo.commandBufferCount = 1;
    submitInfo.pCommandBuffers = &cmdBuf;
    vkQueueSubmit(computeQueue, 1, &submitInfo, fence);
    vkWaitForFences(device, 1, &fence, VK_TRUE, UINT64_MAX);

    // ========== 10. Read Results ==========
    float* ptrC;
    vkMapMemory(device, memC, 0, bufferSize, 0, (void**)&ptrC);
    std::cout << "Results: c[0]=" << ptrC[0] << " c[1]=" << ptrC[1]
              << " (expected 3.0)\n";
    bool correct = true;
    for (uint32_t i = 0; i < N; i++) {
        if (ptrC[i] != 3.0f) { correct = false; break; }
    }
    std::cout << (correct ? "ALL CORRECT" : "ERRORS FOUND") << "\n";
    vkUnmapMemory(device, memC);

    // ========== Cleanup (abbreviated) ==========
    vkDestroyFence(device, fence, nullptr);
    vkDestroyCommandPool(device, cmdPool, nullptr);
    vkDestroyPipeline(device, pipeline, nullptr);
    vkDestroyPipelineLayout(device, pipelineLayout, nullptr);
    vkDestroyDescriptorPool(device, descPool, nullptr);
    vkDestroyDescriptorSetLayout(device, descLayout, nullptr);
    vkDestroyShaderModule(device, shaderModule, nullptr);
    vkDestroyBuffer(device, bufA, nullptr); vkFreeMemory(device, memA, nullptr);
    vkDestroyBuffer(device, bufB, nullptr); vkFreeMemory(device, memB, nullptr);
    vkDestroyBuffer(device, bufC, nullptr); vkFreeMemory(device, memC, nullptr);
    vkDestroyDevice(device, nullptr);
    vkDestroyInstance(instance, nullptr);

    return 0;
}

Yes, this is ~200 lines for vector addition. Compare to CUDA's ~30 lines. This is the cost of explicitness. But notice: every line has a purpose. There are no hidden driver decisions, no implicit synchronisation, no surprise allocations. You control everything.
In practice, you would wrap this boilerplate in a helper library (or use an existing one like vk-bootstrap, VMA for memory allocation, or kompute for ML-focused Vulkan compute).

Kompute: Simplified Vulkan for ML¶

Kompute is an open-source C++ library that wraps Vulkan's boilerplate for GPU compute. The same vector addition becomes:

#include <kompute/Kompute.hpp>

int main() {
    kp::Manager mgr;

    auto tensorA = mgr.tensor({1, 1, 1, 1, 1});
    auto tensorB = mgr.tensor({2, 2, 2, 2, 2});
    auto tensorC = mgr.tensor({0, 0, 0, 0, 0});

    std::string shader = R"(
        #version 450
        layout(local_size_x = 1) in;
        layout(set=0, binding=0) buffer A { float a[]; };
        layout(set=0, binding=1) buffer B { float b[]; };
        layout(set=0, binding=2) buffer C { float c[]; };
        void main() {
            uint i = gl_GlobalInvocationID.x;
            c[i] = a[i] + b[i];
        }
    )";

    auto algorithm = mgr.algorithm({tensorA, tensorB, tensorC},
                                     kompute::Shader::compile_source(shader));

    mgr.sequence()
        ->record<kp::OpTensorSyncDevice>({tensorA, tensorB, tensorC})
        ->record<kp::OpAlgoDispatch>(algorithm)
        ->record<kp::OpTensorSyncLocal>({tensorC})
        ->eval();

    // tensorC now contains [3, 3, 3, 3, 3]
}

Much more readable. Kompute handles instance creation, device selection, memory allocation, descriptor sets, and command buffer management. You focus on the shader and the data.

WebGPU: GPU Compute in the Browser¶

WebGPU is the successor to WebGL, providing modern GPU access from JavaScript. It is built on Vulkan (Linux/Android), Metal (macOS/iOS), and DirectX 12 (Windows), abstracting the platform differences.
WebGPU uses WGSL (WebGPU Shading Language) instead of GLSL:

// add.wgsl — WebGPU compute shader
@group(0) @binding(0) var<storage, read> a: array<f32>;
@group(0) @binding(1) var<storage, read> b: array<f32>;
@group(0) @binding(2) var<storage, read_write> c: array<f32>;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3<u32>) {
    let i = id.x;
    c[i] = a[i] + b[i];
}

JavaScript setup (condensed):

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();

// Create buffers
const bufferA = device.createBuffer({ size: N * 4, usage: GPUBufferUsage.STORAGE, mappedAtCreation: true });
new Float32Array(bufferA.getMappedRange()).fill(1.0);
bufferA.unmap();

// ... (similar for B and C)

// Create pipeline from WGSL shader
const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: { module: device.createShaderModule({ code: wgslSource }), entryPoint: 'main' }
});

// Dispatch
const encoder = device.createCommandEncoder();
const pass = encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(Math.ceil(N / 256));
pass.end();
device.queue.submit([encoder.finish()]);

Why WebGPU matters for ML: running inference in the browser means no server cost, no latency, and user data never leaves the device. Libraries like ONNX Runtime Web and Transformers.js use WebGPU to run models (including small LLMs) entirely client-side.

When to Use Vulkan¶

Scenario	Use Vulkan?	Why / Alternative
ML training	No	CUDA/Triton is simpler and faster on NVIDIA
Inference on NVIDIA GPUs	No	TensorRT or CUDA is better
Inference on AMD/Intel GPUs	Yes	Only cross-vendor GPU compute option
Mobile inference (Android)	Yes	Vulkan is the standard GPU API on Android
Mobile inference (iOS)	No	Use Metal directly (MoltenVK adds overhead)
Browser inference	WebGPU	Built on Vulkan/Metal/DX12
Game engine + ML	Yes	Engine already uses Vulkan for rendering
Cross-platform library	Yes	One codebase for all GPU vendors
Learning GPU programming	Maybe	CUDA is easier to start with; Vulkan teaches more

Coding Tasks (compile with g++ -lvulkan, requires Vulkan SDK)¶

Compile and run the vector addition example above. Modify the shader to compute c[i] = a[i] * b[i] + a[i] (fused multiply-add) and verify the results.
Write a compute shader that applies softmax to a row of data using shared memory for the reduction steps (max and sum). Test with known values.

// softmax.comp — compile with: glslangValidator -V softmax.comp -o softmax.spv
#version 450

#define WG_SIZE 256

layout(local_size_x = WG_SIZE) in;

layout(set = 0, binding = 0) buffer Input  { float input_data[]; };
layout(set = 0, binding = 1) buffer Output { float output_data[]; };

layout(push_constant) uniform PC { uint n; };

shared float sdata[WG_SIZE];

void main() {
    uint gid = gl_GlobalInvocationID.x;
    uint lid = gl_LocalInvocationID.x;

    // Step 1: find max (for numerical stability)
    sdata[lid] = (gid < n) ? input_data[gid] : -1e30;
    barrier();
    for (uint s = WG_SIZE / 2; s > 0; s >>= 1) {
        if (lid < s) sdata[lid] = max(sdata[lid], sdata[lid + s]);
        barrier();
    }
    float maxVal = sdata[0];
    barrier();

    // Step 2: compute exp(x - max)
    float expVal = (gid < n) ? exp(input_data[gid] - maxVal) : 0.0;
    sdata[lid] = expVal;
    barrier();

    // Step 3: sum of exp values
    for (uint s = WG_SIZE / 2; s > 0; s >>= 1) {
        if (lid < s) sdata[lid] += sdata[lid + s];
        barrier();
    }
    float sumExp = sdata[0];

    // Step 4: normalise
    if (gid < n) {
        output_data[gid] = expVal / sumExp;
    }
}

Modify the C++ host code to benchmark the compute shader: time the dispatch (excluding setup) using Vulkan timestamp queries or CPU-side fences, and compute the achieved bandwidth in GB/s.