CUDA Backend
The CUDA backend provides high-performance tensor operations on NVIDIA GPUs using the CUDA toolkit. It offers the highest performance for supported operations and integrates well with the broader CUDA ecosystem.
Features
- Peak Performance: Optimized kernels for maximum NVIDIA GPU utilization
- Optimized Kernels: Hardware-accelerated tensor operations
- Memory Optimization: Efficient GPU memory management
- Mature Ecosystem: Integration with existing CUDA libraries
- Production Ready: Battle-tested in production environments
Installation
Prerequisites
CUDA Toolkit: Install NVIDIA CUDA Toolkit 11.0 or later
- Download from NVIDIA Developer
- Ensure
nvcc
is in your PATH - Verify installation:
nvcc --version
Compatible GPU: NVIDIA GPU with compute capability 3.5+
- Check compatibility:
nvidia-smi
- Verify compute capability:
deviceQuery
(CUDA samples)
Cargo Configuration
Enable the CUDA backend:
[dependencies]
tensor_frame = { version = "0.0.3-alpha", features = ["cuda"] }
Build Requirements:
- CUDA Toolkit installed
- NVIDIA GPU drivers
- C++ compiler (MSVC on Windows, GCC/Clang on Linux)
System Requirements
Hardware
- GPU: NVIDIA GPU with compute capability 3.5+
- Memory: Sufficient GPU memory for tensor operations
- PCIe: PCIe 3.0 x16 recommended for optimal memory bandwidth
Software
- CUDA Toolkit: Version 11.0+ (12.0+ recommended)
- Driver: NVIDIA driver supporting your CUDA version
- OS: Linux (preferred), Windows 10+, WSL2
Verified Configurations
GPU Generation | Compute Capability | CUDA Version | Status |
---|---|---|---|
Maxwell (GTX 900) | 5.0, 5.2 | 11.0+ | ✅ Supported |
Pascal (GTX 10x0) | 6.0, 6.1 | 11.0+ | ✅ Fully supported |
Volta (V100) | 7.0 | 11.0+ | ✅ Optimized |
Turing (RTX 20x0) | 7.5 | 11.0+ | ✅ Optimized |
Ampere (RTX 30x0) | 8.0, 8.6 | 11.2+ | ✅ Optimal |
Ada (RTX 40x0) | 8.9 | 12.0+ | ✅ Latest features |
Implementation Details
Storage
CUDA tensors use device memory pointers:
pub struct CudaStorage {
pub ptr: *mut f32, // Raw CUDA device pointer
pub len: usize, // Buffer length in elements
}
Memory Properties:
- Location: GPU global memory (VRAM)
- Layout: Contiguous, row-major layout
- Alignment: 256-byte aligned for optimal coalescing
- Synchronization: Explicit via CUDA streams
Kernel Implementation
Operations use optimized CUDA kernels:
// Element-wise addition kernel
__global__ void add_kernel(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
Performance Characteristics
Strengths
- Compute Throughput: Maximum FP32/FP16 throughput on NVIDIA hardware
- Memory Bandwidth: Optimal utilization of GPU memory bandwidth
- Kernel Optimization: Hand-tuned kernels for each operation
- Library Integration: Designed for future integration with cuDNN, etc.
Performance Metrics
Example performance on RTX 4090:
Operation | Tensor Size | CPU (32 cores) | CUDA | Speedup |
---|---|---|---|---|
Element-wise Add | 1M elements | 2.1 ms | 0.18 ms | 12x |
Matrix Multiply | 2048x2048 | 450 ms | 8.2 ms | 55x |
Reduction Sum | 16M elements | 15 ms | 0.52 ms | 29x |
Optimization Guidelines
Optimal Use Cases
// Large tensor operations
let a = Tensor::zeros(vec![4096, 4096])?;
let b = Tensor::zeros(vec![4096, 4096])?;
let c = (a * b) + 1.0; // Excellent GPU performance
// Batch operations
for batch in large_dataset {
let result = model.forward(batch)?; // Amortizes GPU overhead
}
// Memory-bound operations
let result = ((a * b) + c) / d; // GPU memory bandwidth utilized
Suboptimal Use Cases
// Very small tensors
let tiny = Tensor::ones(vec![8, 8])?; // Kernel launch overhead dominates
// Frequent host-device transfers
let gpu_result = cpu_tensor.to_backend(BackendType::Cuda)?;
let back_to_cpu = gpu_result.to_vec()?; // PCIe bandwidth bottleneck
// Scalar reductions with immediate use
let sum = tensor.sum(None)?.to_vec()?; // Forces synchronization
Memory Management
Device Memory Allocation
CUDA tensors allocate GPU memory directly:
// Allocates 64MB of GPU memory
let large_tensor = Tensor::zeros(vec![4096, 4096])?
.to_backend(BackendType::Cuda)?;
Memory Pool Management
The backend uses a memory pool for efficient allocation:
// Pool reduces allocation overhead
let tensors: Vec<Tensor> = (0..100)
.map(|_| Tensor::zeros(vec![1024, 1024]))
.collect::<Result<Vec<_>>>()?;
Memory Transfer Optimization
// Efficient: Batch transfers
let gpu_tensors = cpu_tensors
.into_iter()
.map(|t| t.to_backend(BackendType::Cuda))
.collect::<Result<Vec<_>>>()?;
// Inefficient: Individual transfers
for cpu_tensor in cpu_tensors {
let gpu_tensor = cpu_tensor.to_backend(BackendType::Cuda)?;
process(gpu_tensor)?;
}
Memory Debugging
Monitor GPU memory usage:
# Check GPU memory
nvidia-smi
# Continuous monitoring
watch -n 1 nvidia-smi
// Check available memory
let (free, total) = cuda::memory_info()?;
println!("GPU memory: {}/{} MB", free / 1024 / 1024, total / 1024 / 1024);
// Handle out-of-memory
match Tensor::zeros(vec![16384, 16384]).and_then(|t| t.to_backend(BackendType::Cuda)) {
Ok(tensor) => println!("Allocated 1GB GPU tensor"),
Err(TensorError::BackendError(msg)) if msg.contains("memory") => {
eprintln!("GPU OOM, trying smaller allocation");
}
Err(e) => eprintln!("CUDA error: {}", e),
}
Error Handling
CUDA operations can fail for various hardware and software reasons:
Runtime Errors
use tensor_frame::{Tensor, TensorError};
match tensor_operation() {
Ok(result) => process(result),
Err(TensorError::BackendError(msg)) => {
if msg.contains("out of memory") {
// GPU memory exhausted
fallback_to_cpu()?;
} else if msg.contains("invalid device") {
// GPU not available or driver issue
retry_with_cpu_backend()?;
} else {
// Other CUDA error
eprintln!("CUDA error: {}", msg);
}
}
}
Common Error Scenarios
- GPU Out of Memory: Tensor too large for available GPU memory
- Invalid Device: GPU not found or not compatible
- Driver Mismatch: CUDA driver version incompatible
- Kernel Launch Failed: Invalid kernel parameters or GPU fault
- Memory Access Violation: Invalid GPU memory access
Error Recovery
// Graceful fallback strategy
fn robust_tensor_operation(tensor: Tensor) -> Result<Tensor> {
// Try CUDA first
if let Ok(cuda_tensor) = tensor.to_backend(BackendType::Cuda) {
match cuda_operation(cuda_tensor) {
Ok(result) => return Ok(result),
Err(TensorError::BackendError(_)) => {
// CUDA failed, fall back to CPU
eprintln!("CUDA operation failed, falling back to CPU");
}
}
}
// CPU fallback
cpu_operation(tensor.to_backend(BackendType::Cpu)?)
}
Debugging and Profiling
CUDA Debugging Tools
NVIDIA Nsight Systems: System-wide performance analysis
nsys profile --stats=true ./your_app
NVIDIA Nsight Compute: Kernel-level profiling
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./your_app
cuda-memcheck: Memory error detection
cuda-memcheck ./your_app
Performance Analysis
// GPU timing with CUDA events
use std::time::Instant;
let start = Instant::now();
let result = gpu_tensor_a.matmul(&gpu_tensor_b)?;
// Note: matmul is asynchronous!
let _sync = result.to_vec()?; // Force synchronization
let elapsed = start.elapsed();
println!("Matrix multiplication took: {:?}", elapsed);
Memory Leak Detection
// Monitor for memory leaks in long-running applications
fn check_memory_usage() -> Result<()> {
let (free_before, total) = cuda::memory_info()?;
// Perform operations
{
let tensor = Tensor::zeros(vec![1000, 1000])?.to_backend(BackendType::Cuda)?;
let result = expensive_operation(tensor)?;
} // tensor should be freed here
let (free_after, _) = cuda::memory_info()?;
if free_after < free_before {
eprintln!("Potential memory leak detected!");
eprintln!("Memory delta: {} MB", (free_before - free_after) / 1024 / 1024);
}
Ok(())
}
Production Deployment
Docker Configuration
# Use NVIDIA CUDA base image
FROM nvidia/cuda:12.0-devel-ubuntu20.04
# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
# Copy and build your application
COPY . /app
WORKDIR /app
RUN cargo build --release --features cuda
# Runtime with CUDA
FROM nvidia/cuda:12.0-runtime-ubuntu20.04
COPY --from=0 /app/target/release/your_app /usr/local/bin/
CMD ["your_app"]
Kubernetes Deployment
apiVersion: v1
kind: Pod
spec:
containers:
- name: tensor-app
image: your-app:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
Environment Variables
# Limit GPU memory growth
export CUDA_MEMORY_POOL_TYPE=pool
# Enable GPU timing
export CUDA_LAUNCH_BLOCKING=1
# Select specific GPU
export CUDA_VISIBLE_DEVICES=0
Optimization Best Practices
Memory Access Patterns
// Coalesced memory access (efficient)
let result = tensor_a + tensor_b; // Sequential element access
// Strided access (less efficient)
let transposed = tensor.transpose()?; // May require memory reshape
Kernel Fusion
// Fused operations (single kernel launch)
let result = ((a * b) + c).relu(); // Ideally fused into one kernel
// Separate operations (multiple kernel launches)
let temp1 = a * b;
let temp2 = temp1 + c;
let result = temp2.relu(); // Three separate kernels
Stream Management
// Future: Async operations with CUDA streams
// Currently synchronous, but optimizations planned
let stream_a = cuda::create_stream()?;
let stream_b = cuda::create_stream()?;
// Parallel execution on different streams
let result_a = tensor_a.sum(None).execute_on(stream_a)?;
let result_b = tensor_b.mean(None).execute_on(stream_b)?;
Integration with CUDA Ecosystem
cuDNN (Future)
Planned integration for neural network operations:
// Future: Convolution operations
let output = input.conv2d(&kernel, stride, padding)?;
NCCL (Future)
Multi-GPU communication for distributed computing:
// Future: Multi-GPU operations
let distributed_result = tensor.all_reduce_sum()?;