WGPU Backend
The WGPU backend provides cross-platform GPU compute acceleration using the WebGPU standard. It supports Metal, Vulkan, DirectX 12, and OpenGL backends, making it an excellent choice for portable high-performance computing.
Features
- Cross-Platform: Works on Windows, macOS, Linux, iOS, Android, and Web
- Multiple APIs: Supports Vulkan, Metal, DX12, DX11, OpenGL ES, and WebGL
- Compute Shaders: Uses WGSL (WebGPU Shading Language) for parallel operations
- Memory Efficient: GPU buffer management with automatic cleanup
- Future-Proof: Based on the emerging WebGPU standard
Installation
Enable the WGPU backend with the feature flag:
[dependencies]
tensor_frame = { version = "0.0.3-alpha", features = ["wgpu"] }
Additional Dependencies:
- No platform-specific GPU drivers required
- Uses system graphics drivers (Metal, Vulkan, DirectX, OpenGL)
System Requirements
Minimum Requirements
- GPU: Any GPU with compute shader support
- Driver: Up-to-date graphics drivers
- Memory: Sufficient GPU memory for tensor data
Supported Platforms
Platform | Graphics API | Status |
---|---|---|
Windows | DirectX 12, Vulkan | ✅ Full support |
Windows | DirectX 11 | ✅ Fallback support |
macOS | Metal | ✅ Full support |
Linux | Vulkan | ✅ Full support |
Linux | OpenGL ES | ⚠️ Limited support |
iOS | Metal | ✅ Full support |
Android | Vulkan, OpenGL ES | ✅ Full support |
Web | WebGPU, WebGL2 | ⚠️ Experimental |
Implementation Details
Storage
WGPU tensors use GPU buffers for data storage:
pub struct WgpuStorage {
pub buffer: Arc<wgpu::Buffer>, // GPU buffer handle
}
Buffer Properties:
- Location: GPU memory (VRAM)
- Layout: Contiguous, row-major layout
- Usage: Storage buffers with compute shader access
- Synchronization: Automatic via command queue
Compute Shaders
Operations are implemented as WGSL compute shaders loaded from external files in src/shaders/
:
add.wgsl
- Element-wise additionsub.wgsl
- Element-wise subtractionmul.wgsl
- Element-wise multiplicationdiv.wgsl
- Element-wise division with IEEE 754 compliance
// Example: Element-wise addition shader (add.wgsl)
@group(0) @binding(0) var<storage, read> input_a: array<f32>;
@group(0) @binding(1) var<storage, read> input_b: array<f32>;
@group(0) @binding(2) var<storage, read_write> output: array<f32>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let index = global_id.x;
if (index >= arrayLength(&input_a)) {
return;
}
output[index] = input_a[index] + input_b[index];
}
Parallelization
- Workgroups: Operations dispatched in parallel workgroups
- Thread Count: Automatically sized based on tensor dimensions
- GPU Utilization: Optimized for high occupancy on modern GPUs
Performance Characteristics
Strengths
- Massive Parallelism: Thousands of parallel threads
- High Throughput: Excellent for large tensor operations
- Memory Bandwidth: High GPU memory bandwidth utilization
- Compute Density: Specialized compute units for arithmetic operations
Limitations
- Latency: GPU command submission and synchronization overhead
- Memory Transfer: CPU-GPU data transfers can be expensive
- Limited Precision: Currently only supports f32 operations
- Shader Compilation: First-use compilation overhead
Performance Guidelines
Optimal Use Cases
// Large tensor operations (> 10K elements)
let large = Tensor::zeros(vec![2048, 2048])?;
let result = (large_a * large_b) + large_c;
// Repeated operations on same-sized tensors
for batch in batches {
let output = model.forward(batch)?; // Shader programs cached
}
// Element-wise operations with complex expressions
let result = ((a * b) + c).sqrt(); // Single GPU kernel
Suboptimal Use Cases
// Very small tensors
let small = Tensor::ones(vec![10, 10])?; // GPU overhead dominates
// Frequent CPU-GPU transfers
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
let back_to_cpu = gpu_tensor.to_vec()?; // Expensive transfers
// Scalar operations
let sum = tensor.sum(None)?; // Result copied back to CPU
Memory Management
GPU Memory Allocation
WGPU automatically manages GPU memory:
let tensor = Tensor::zeros(vec![2048, 2048])?; // Allocates ~16MB GPU memory
Memory Pool: WGPU uses internal memory pools for efficient allocation
Garbage Collection: Buffers automatically freed when last reference dropped
Fragmentation: Large allocations may fail even with sufficient total memory
Memory Transfer Patterns
// Efficient: Create on GPU
let gpu_tensor = Tensor::zeros(vec![1000, 1000])?
.to_backend(BackendType::Wgpu)?;
// Inefficient: Frequent transfers
let result = cpu_data.to_backend(BackendType::Wgpu)?
.sum(None)?
.to_backend(BackendType::Cpu)?; // Multiple transfers
Memory Debugging
Monitor GPU memory usage:
// Check GPU memory limits
let limits = device.limits();
println!("Max buffer size: {} MB", limits.max_buffer_size / (1024*1024));
// Handle out-of-memory errors
match Tensor::zeros(vec![16384, 16384]) {
Ok(tensor) => println!("Allocated 1GB GPU tensor"),
Err(TensorError::BackendError(msg)) if msg.contains("memory") => {
eprintln!("GPU out of memory, trying smaller size");
}
Err(e) => eprintln!("Other error: {}", e),
}
Debugging and Profiling
Shader Debugging
WGPU provides validation and debugging features:
// Enable validation (debug builds)
let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor {
backends: wgpu::Backends::all(),
flags: wgpu::InstanceFlags::DEBUG | wgpu::InstanceFlags::VALIDATION,
..Default::default()
});
Performance Profiling
Use GPU profiling tools:
Windows (DirectX):
- PIX for Windows
- RenderDoc
- Visual Studio Graphics Diagnostics
macOS (Metal):
- Xcode Instruments (GPU Timeline)
- Metal System Trace
Linux (Vulkan):
- RenderDoc
- Vulkan Tools
Custom Timing
use std::time::Instant;
let start = Instant::now();
let result = gpu_tensor_a + gpu_tensor_b;
// Note: GPU operations are asynchronous!
let _data = result.to_vec()?; // Synchronization point
println!("GPU operation took: {:?}", start.elapsed());
Error Handling
WGPU backend errors can occur at multiple levels:
Device Creation Errors
match WgpuBackend::new() {
Ok(backend) => println!("WGPU backend ready"),
Err(TensorError::BackendError(msg)) => {
eprintln!("WGPU initialization failed: {}", msg);
// Fallback to CPU backend
}
}
Runtime Errors
// Out of GPU memory
let result = Tensor::zeros(vec![100000, 100000]); // May fail
// Shader compilation errors (rare)
let result = custom_operation(tensor); // May fail for invalid shaders
// Device lost (driver reset, etc.)
let result = tensor.sum(None); // May fail if device is lost
Common Error Scenarios:
- Device Not Found: No compatible GPU available
- Out of Memory: GPU memory exhausted
- Driver Issues: Outdated or buggy graphics drivers
- Unsupported Operations: Feature not implemented in WGPU backend
Platform-Specific Notes
Windows
- DirectX 12: Best performance and feature support
- Vulkan: Good alternative if DX12 not available
- DirectX 11: Fallback with limited compute support
macOS
- Metal: Excellent native support and performance
- MoltenVK: Vulkan compatibility layer (not recommended for production)
Linux
- Vulkan: Primary choice with best performance
- OpenGL: Fallback with limited compute features
- Graphics Drivers: Ensure latest Mesa/NVIDIA/AMD drivers
Mobile (iOS/Android)
- iOS: Metal provides excellent mobile GPU performance
- Android: Vulkan on newer devices, OpenGL ES fallback
- Power Management: Be aware of thermal throttling
Web (Experimental)
- WebGPU: Emerging standard with excellent performance potential
- WebGL2: Fallback with compute shader emulation
- Browser Support: Chrome/Edge (flag), Firefox (experimental)
Optimization Tips
Workgroup Size Tuning
// Optimal workgroup sizes depend on GPU architecture
// Current default: 64 threads per workgroup
// Nvidia: 32 (warp size) or 64
// AMD: 64 (wavefront size)
// Intel: 32 or 64
// Mobile: 16 or 32
Batch Operations
// Efficient: Batch similar operations
let results: Vec<Tensor> = inputs
.iter()
.map(|input| model.forward(input))
.collect()?;
// Inefficient: Individual operations
for input in inputs {
let result = model.forward(input)?;
let cpu_result = result.to_vec()?; // Forces synchronization
}
Memory Layout Optimization
// Ensure tensor shapes are GPU-friendly
let aligned_size = (size + 63) & !63; // Align to 64-element boundaries
let tensor = Tensor::zeros(vec![aligned_size, aligned_size])?;
Future Developments
The WGPU backend is actively developed with planned improvements:
- Reduction Operations: Sum, mean, and other reductions on GPU
- Advanced Operations: GPU-optimized tensor operations
- Mixed Precision: f16 and bf16 data type support
- Async Operations: Fully asynchronous GPU command queues
- WebGPU Stability: Production-ready web deployment