Backends Overview
Tensor Frame's backend system provides a pluggable architecture for running tensor operations on different computational devices. This allows the same high-level tensor API to transparently utilize CPU cores, integrated GPUs, discrete GPUs, and specialized accelerators.
Available Backends
Backend | Feature Flag | Availability | Best Use Cases |
---|---|---|---|
CPU | cpu (default) | Always | Small tensors, development, fallback |
WGPU | wgpu | Cross-platform GPU | Large tensors, cross-platform deployment |
CUDA | cuda | NVIDIA GPUs | High-performance production workloads |
Backend Selection Strategy
Automatic Selection (Recommended)
By default, Tensor Frame automatically selects the best available backend using this priority order:
- CUDA - Highest performance on NVIDIA hardware
- WGPU - Cross-platform GPU acceleration
- CPU - Universal fallback
use tensor_frame::Tensor;
// Automatically uses best available backend
let tensor = Tensor::zeros(vec![1000, 1000])?;
println!("Using backend: {:?}", tensor.backend_type());
Manual Backend Control
For specific requirements, you can control backend selection:
use tensor_frame::backend::{set_backend_priority, BackendType};
// Force CPU-only execution
let backend = set_backend_priority(vec![BackendType::Cpu]);
// Prefer WGPU over CUDA
let backend = set_backend_priority(vec![
BackendType::Wgpu,
BackendType::Cuda,
BackendType::Cpu
]);
Per-Tensor Backend Conversion
Convert individual tensors between backends:
let cpu_tensor = Tensor::ones(vec![100, 100])?;
// Move to GPU
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
// Move back to CPU
let back_to_cpu = gpu_tensor.to_backend(BackendType::Cpu)?;
Performance Characteristics
CPU Backend
- Latency: Lowest for small operations (< 1ms)
- Throughput: Limited by CPU cores and memory bandwidth
- Memory: System RAM (typically abundant)
- Parallelism: Thread-level via Rayon
- Overhead: Minimal function call overhead
WGPU Backend
- Latency: Higher initialization cost (~1-10ms)
- Throughput: High for large, parallel operations
- Memory: GPU memory (limited but fast)
- Parallelism: Massive thread-level via compute shaders
- Overhead: GPU command submission and synchronization
CUDA Backend
- Latency: Moderate initialization cost (~0.1-1ms)
- Throughput: Highest for supported operations
- Memory: GPU memory with CUDA optimizations
- Parallelism: Optimal GPU utilization via cuBLAS/cuDNN
- Overhead: Minimal with mature driver stack
When to Use Each Backend
CPU Backend
// Good for:
let small_tensor = Tensor::ones(vec![10, 10])?; // Small tensors
let dev_tensor = Tensor::zeros(vec![100])?; // Development/testing
let scalar_ops = tensor.sum(None)?; // Scalar results
// Avoid for:
// - Large matrix multiplications (> 1000x1000)
// - Batch operations on many tensors
// - Compute-intensive element-wise operations
WGPU Backend
// Good for:
let large_tensor = Tensor::zeros(vec![2048, 2048])?; // Large tensors
let batch_ops = tensors.iter().map(|t| t * 2.0); // Batch operations
let element_wise = (a * b) + c; // Complex element-wise
// Consider for:
// - Cross-platform deployment
// - When CUDA is not available
// - Mixed CPU/GPU workloads
CUDA Backend
// Excellent for:
let huge_tensor = Tensor::zeros(vec![4096, 4096])?; // Very large tensors
let matrix_mul = a.matmul(&b)?; // Matrix operations
let ml_workload = model.forward(input)?; // ML training/inference
// Best when:
// - NVIDIA GPU available
// - Performance is critical
// - Using alongside other CUDA libraries
Cross-Backend Operations
Operations between tensors on different backends automatically handle conversion:
let cpu_a = Tensor::ones(vec![1000])?;
let gpu_b = Tensor::zeros(vec![1000])?.to_backend(BackendType::Wgpu)?;
// Automatically converts to common backend
let result = cpu_a + gpu_b; // Runs on CPU backend
Conversion Rules:
- If backends match, operation runs on that backend
- If backends differ, converts to the "lower priority" backend
- Priority order: CPU > WGPU > CUDA (CPU is most compatible)
Memory Management
Reference Counting
All backends use reference counting for efficient memory management:
let tensor1 = Tensor::ones(vec![1000, 1000])?;
let tensor2 = tensor1.clone(); // O(1) - just increments reference count
// Memory freed automatically when last reference dropped
Cross-Backend Memory
Converting between backends allocates new memory:
let cpu_tensor = Tensor::ones(vec![1000])?; // 4KB CPU memory
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; // +4KB GPU memory
// Both tensors exist independently until dropped
Memory Usage Guidelines
- Development: Use CPU backend to avoid GPU memory pressure
- Production: Convert to GPU early, minimize cross-backend copies
- Mixed workloads: Keep frequently-accessed tensors on CPU
- Large datasets: Stream data through GPU backends
Error Handling
Backend operations can fail for various reasons:
match Tensor::zeros(vec![100000, 100000]) {
Ok(tensor) => println!("Created tensor on {:?}", tensor.backend_type()),
Err(TensorError::BackendError(msg)) => {
eprintln!("Backend error: {}", msg);
// Fallback to smaller size or different backend
}
Err(e) => eprintln!("Other error: {}", e),
}
Common Error Scenarios:
- GPU Out of Memory: Try smaller tensors or CPU backend
- Backend Unavailable: Fallback to CPU backend
- Feature Not Implemented: Some operations only available on certain backends
- Cross-Backend Type Mismatch: Ensure compatible data types
Backend Implementation Status
Operation | CPU | WGPU | CUDA |
---|---|---|---|
Basic arithmetic (+, -, *, /) | ✅ | ✅ | ✅ |
Reductions (sum, mean) | ✅ | ❌ | ✅ |
Reshape, transpose | ✅ | ✅ | ✅ |
Broadcasting | ✅ | ✅ | ✅ |
✅ = Fully implemented
❌ = Not yet implemented
⚠️ = Partially implemented