Tensor Frame
Tensor Frame is a high-performance, PyTorch-like tensor library for Rust that supports multiple computational backends including CPU (with Rayon), WGPU (for GPU compute), and CUDA.
Features
- Multiple Backends: Automatic backend selection with fallback support
- CPU backend with Rayon for parallel processing
- WGPU backend for cross-platform GPU computing
- CUDA backend for NVIDIA GPU acceleration
- PyTorch-like API: Familiar tensor operations and broadcasting
- Dynamic Tensors: Runtime shape and type flexibility
- Full Broadcasting Support: NumPy-style automatic shape broadcasting for all arithmetic operations (+, -, *, /)
- Zero-Copy Operations: Efficient memory management where possible
- Feature Flags: Optional backends via Cargo features
Quick Example
use tensor_frame::Tensor;
// Create tensors (automatically uses the best available backend)
let a = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
let b = Tensor::from_vec(vec![10.0, 20.0], vec![2, 1])?;
// Perform operations with automatic broadcasting
let c = (a + b)?; // Broadcasting: [2,2] + [2,1] -> [2,2]
println!("Result: {:?}", c.to_vec()?); // [11.0, 12.0, 23.0, 24.0]
// All operations support broadcasting
let scalar = Tensor::from_vec(vec![2.0], vec![])?;
let scaled = (c / scalar)?; // Divide by scalar
let sum = scaled.sum(None)?; // Sum all elements
println!("Sum: {:?}", sum.to_vec()?);
Backend Priority
By default, Tensor Frame will attempt to use backends in this order:
- CUDA (if available and feature enabled)
- WGPU (if available and feature enabled)
- CPU (always available)
You can also explicitly specify a backend or create custom backend implementations.
Getting Started
Installation
Add Tensor Frame to your Cargo.toml
:
[dependencies]
tensor_frame = "0.0.3-alpha"
Feature Flags
Tensor Frame supports optional backends via feature flags:
[dependencies]
# CPU only (default)
tensor_frame = "0.0.3-alpha"
# With WGPU support
tensor_frame = { version = "0.0.3-alpha", features = ["wgpu"] }
# With CUDA support
tensor_frame = { version = "0.0.3-alpha", features = ["cuda"] }
# All backends
tensor_frame = { version = "0.0.3-alpha", features = ["wgpu", "cuda"] }
Basic Usage
Creating Tensors
use tensor_frame::{Tensor, Result};
fn main() -> Result<()> {
// Create tensors with different initialization
let zeros = Tensor::zeros(vec![2, 3])?;
let ones = Tensor::ones(vec![2, 3])?;
let from_data = Tensor::from_vec(
vec![1.0, 2.0, 3.0, 4.0],
vec![2, 2]
)?;
// Inspect tensor properties
println!("Shape: {:?}", zeros.shape().dims());
println!("Number of elements: {}", zeros.numel());
println!("Number of dimensions: {}", zeros.ndim());
Ok(())
}
Basic Operations
use tensor_frame::{Tensor, Result};
fn main() -> Result<()> {
let a = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
let b = Tensor::from_vec(vec![5.0, 6.0, 7.0, 8.0], vec![2, 2])?;
// Element-wise operations
let sum = (a.clone() + b.clone())?;
let diff = (a.clone() - b.clone())?;
let product = (a.clone() * b.clone())?;
let quotient = (a / b)?;
// Reduction operations
let total = sum.sum(None)?;
let average = product.mean(None)?;
println!("Sum result: {:?}", total.to_vec()?);
Ok(())
}
Broadcasting
Tensor Frame supports automatic broadcasting similar to NumPy and PyTorch:
use tensor_frame::{Tensor, Result};
fn main() -> Result<()> {
let a = Tensor::ones(vec![2, 1])?; // Shape: [2, 1]
let b = Tensor::ones(vec![1, 3])?; // Shape: [1, 3]
// Broadcasting: [2, 1] + [1, 3] -> [2, 3]
let c = (a + b)?;
println!("Result shape: {:?}", c.shape().dims());
Ok(())
}
Tensor Manipulation
use tensor_frame::{Tensor, Result, TensorOps};
fn main() -> Result<()> {
let tensor = Tensor::from_vec(
vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
vec![2, 3]
)?;
// Reshape
let reshaped = tensor.reshape(vec![3, 2])?;
// Transpose (2D only for now)
let transposed = reshaped.transpose()?;
// Squeeze and unsqueeze
let squeezed = tensor.squeeze(None)?;
let unsqueezed = squeezed.unsqueeze(0)?;
Ok(())
}
API Reference
This section provides detailed documentation for all public APIs in Tensor Frame.
Core Types
- Tensor - The main tensor type with all operations
- Backends - Backend trait and implementation details
- Operations - Detailed operation specifications
Key Traits and Enums
TensorOps Trait
The TensorOps
trait defines all tensor manipulation and computation operations:
pub trait TensorOps {
fn reshape(&self, new_shape: Vec<usize>) -> Result<Tensor>;
fn transpose(&self) -> Result<Tensor>;
fn squeeze(&self, dim: Option<usize>) -> Result<Tensor>;
fn unsqueeze(&self, dim: usize) -> Result<Tensor>;
// ... more methods
}
DType Enum
Supported data types:
pub enum DType {
F32, // 32-bit floating point (default)
F64, // 64-bit floating point
I32, // 32-bit signed integer
U32, // 32-bit unsigned integer
}
BackendType Enum
Available computational backends:
pub enum BackendType {
Cpu, // CPU backend with Rayon
Wgpu, // Cross-platform GPU backend
Cuda, // NVIDIA CUDA backend
}
Error Handling
All operations return Result<T>
with TensorError
for comprehensive error handling:
pub enum TensorError {
ShapeMismatch { expected: Vec<usize>, got: Vec<usize> },
BackendError(String),
InvalidOperation(String),
DimensionError(String),
}
Memory Management
Tensor Frame uses smart pointers and reference counting for efficient memory management:
- Tensors are cheaply clonable (reference counted)
- Backend storage is automatically managed
- Cross-backend tensor conversion is supported
- Zero-copy operations where possible
Tensor API
The Tensor
struct is the core data structure in Tensor Frame, representing multi-dimensional arrays with automatic backend selection.
Constructor Methods
Basic Constructors
// Create tensor filled with zeros
pub fn zeros(shape: Vec<usize>) -> Result<Tensor>
// Create tensor filled with ones
pub fn ones(shape: Vec<usize>) -> Result<Tensor>
// Create tensor from Vec data
pub fn from_vec(data: Vec<f32>, shape: Vec<usize>) -> Result<Tensor>
Examples
use tensor_frame::Tensor;
// 2x3 matrix of zeros
let zeros = Tensor::zeros(vec![2, 3])?;
// 1D vector of ones
let ones = Tensor::ones(vec![5])?;
// Create from existing data
let data = vec![1.0, 2.0, 3.0, 4.0];
let tensor = Tensor::from_vec(data, vec![2, 2])?;
Properties
Shape Information
// Get tensor shape
pub fn shape(&self) -> &Shape
// Get number of elements
pub fn numel(&self) -> usize
// Get number of dimensions
pub fn ndim(&self) -> usize
Data Access
// Convert tensor to Vec<f32>
pub fn to_vec(&self) -> Result<Vec<f32>>
Arithmetic Operations
Tensor Frame supports standard arithmetic operations through operator overloading:
Binary Operations
// Addition (element-wise)
let c = a + b;
let c = &a + &b; // Avoid cloning
// Subtraction (element-wise)
let c = a - b;
// Multiplication (element-wise)
let c = a * b;
// Division (element-wise)
let c = a / b;
Broadcasting Rules
Addition operations automatically broadcast tensors following NumPy/PyTorch rules. Note: Broadcasting is currently only implemented for addition; other operations require matching shapes.
- Dimensions are aligned from the right
- Missing dimensions are treated as size 1
- Dimensions of size 1 are expanded to match
let a = Tensor::ones(vec![2, 1, 3])?; // Shape: [2, 1, 3]
let b = Tensor::ones(vec![1, 4, 1])?; // Shape: [1, 4, 1]
let c = a + b; // Result: [2, 4, 3]
Tensor Manipulation
Reshaping
impl TensorOps for Tensor {
// Change tensor shape (must preserve total elements)
fn reshape(&self, new_shape: Vec<usize>) -> Result<Tensor>;
}
let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], vec![2, 3])?;
let reshaped = tensor.reshape(vec![3, 2])?; // 2x3 -> 3x2
Transposition
// Transpose 2D tensor (swap dimensions)
fn transpose(&self) -> Result<Tensor>;
let matrix = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
let transposed = matrix.transpose()?; // [[1,2],[3,4]] -> [[1,3],[2,4]]
Dimension Manipulation
// Remove dimensions of size 1
fn squeeze(&self, dim: Option<usize>) -> Result<Tensor>;
// Add dimension of size 1
fn unsqueeze(&self, dim: usize) -> Result<Tensor>;
let tensor = Tensor::ones(vec![1, 3, 1])?; // Shape: [1, 3, 1]
let squeezed = tensor.squeeze(None)?; // Shape: [3]
let unsqueezed = squeezed.unsqueeze(0)?; // Shape: [1, 3]
Reduction Operations
Full Reductions
// Sum all elements
fn sum(&self, axis: Option<usize>) -> Result<Tensor>;
// Mean of all elements
fn mean(&self, axis: Option<usize>) -> Result<Tensor>;
let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
// Sum all elements -> scalar tensor
let total = tensor.sum(None)?; // Result: 10.0
// Mean of all elements -> scalar tensor
let average = tensor.mean(None)?; // Result: 2.5
Axis-specific Reductions
Note: Axis-specific reductions are not yet implemented in the CPU backend. Currently, only full tensor reductions (with axis=None
) are supported.
Display and Debug
Tensors implement comprehensive display formatting:
let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
println!("{}", tensor);
// Output:
// Tensor([[1.0000, 2.0000],
// [3.0000, 4.0000]], dtype=f32)
Type Conversions
// Convert to Vec for external use
let data: Vec<f32> = tensor.to_vec()?;
// Clone (cheap - reference counted)
let cloned = tensor.clone();
Performance Notes
- Cloning: Tensors use reference counting, so cloning is O(1)
- Backend Selection: Operations stay on the same backend when possible
- Memory Layout: Tensors use row-major (C-style) memory layout
- Broadcasting: Zero-copy when possible, falls back to explicit expansion
Backend System
Tensor Frame uses a pluggable backend system that allows tensors to run on different computational devices. This page documents the backend architecture and API.
Backend Trait
All backends implement the Backend
trait:
pub trait Backend: Debug + Send + Sync {
fn backend_type(&self) -> BackendType;
fn is_available(&self) -> bool;
// Tensor creation
fn zeros(&self, shape: &Shape, dtype: DType) -> Result<Storage>;
fn ones(&self, shape: &Shape, dtype: DType) -> Result<Storage>;
fn from_slice(&self, data: &[f32], shape: &Shape) -> Result<Storage>;
// Arithmetic operations
fn add(&self, lhs: &Storage, rhs: &Storage) -> Result<Storage>;
fn sub(&self, lhs: &Storage, rhs: &Storage) -> Result<Storage>;
fn mul(&self, lhs: &Storage, rhs: &Storage) -> Result<Storage>;
fn div(&self, lhs: &Storage, rhs: &Storage) -> Result<Storage>;
// Reduction operations
fn sum(&self, storage: &Storage, axis: Option<usize>) -> Result<Storage>;
fn mean(&self, storage: &Storage, axis: Option<usize>) -> Result<Storage>;
// Data access
fn to_vec_f32(&self, storage: &Storage) -> Result<Vec<f32>>;
}
Storage Types
Each backend uses a different storage mechanism:
pub enum Storage {
Cpu(Vec<f32>), // CPU: simple Vec
Wgpu(WgpuStorage), // WGPU: GPU buffer
Cuda(CudaStorage), // CUDA: device pointer
}
pub struct WgpuStorage {
pub buffer: Arc<wgpu::Buffer>, // WGPU buffer handle
}
pub struct CudaStorage {
pub ptr: *mut f32, // Raw CUDA device pointer
pub len: usize, // Buffer length
}
Backend Selection
Automatic Selection
By default, Tensor Frame automatically selects the best available backend:
- CUDA (if available and feature enabled)
- WGPU (if available and feature enabled)
- CPU (always available)
// Uses automatic backend selection
let tensor = Tensor::zeros(vec![1000, 1000])?;
println!("Selected backend: {:?}", tensor.backend_type());
Manual Selection
You can also explicitly specify backend priority:
use tensor_frame::backend::{set_backend_priority, BackendType};
// Force CPU backend
let cpu_backend = set_backend_priority(vec![BackendType::Cpu]);
// Prefer WGPU over CUDA
let gpu_backend = set_backend_priority(vec![
BackendType::Wgpu,
BackendType::Cuda,
BackendType::Cpu
]);
Backend Conversion
Convert tensors between backends:
let cpu_tensor = Tensor::ones(vec![100, 100])?;
// Convert to GPU backend (if available)
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
// Convert back to CPU
let back_to_cpu = gpu_tensor.to_backend(BackendType::Cpu)?;
Performance Characteristics
CPU Backend
- Pros: Always available, good for small tensors, excellent for development
- Cons: Limited parallelism, slower for large operations
- Best for: Tensors < 10K elements, prototyping, fallback option
- Implementation: Uses Rayon for parallel CPU operations
WGPU Backend
- Pros: Cross-platform GPU support, works on Metal/Vulkan/DX12/OpenGL
- Cons: Compute shader overhead, limited by GPU memory
- Best for: Large tensor operations, cross-platform deployment
- Implementation: Compute shaders with buffer storage
CUDA Backend
- Pros: Highest performance on NVIDIA GPUs, mature ecosystem
- Cons: NVIDIA-only, requires CUDA toolkit installation
- Best for: Production workloads on NVIDIA hardware
- Implementation: cuBLAS and custom CUDA kernels
Backend Availability
Check backend availability at runtime:
use tensor_frame::backend::{cpu, wgpu, cuda};
// CPU backend is always available
println!("CPU available: {}", cpu::CpuBackend::new().is_available());
// Check GPU backends
#[cfg(feature = "wgpu")]
if let Ok(wgpu_backend) = wgpu::WgpuBackend::new() {
println!("WGPU available: {}", wgpu_backend.is_available());
}
#[cfg(feature = "cuda")]
println!("CUDA available: {}", cuda::is_available());
Cross-Backend Operations
Operations between tensors on different backends automatically handle conversion:
let cpu_tensor = Tensor::ones(vec![100])?;
let gpu_tensor = Tensor::zeros(vec![100])?.to_backend(BackendType::Wgpu)?;
// Automatically converts gpu_tensor to CPU backend for the operation
let result = cpu_tensor + gpu_tensor;
Custom Backends
You can implement custom backends by implementing the Backend
trait:
#[derive(Debug)]
struct MyCustomBackend;
impl Backend for MyCustomBackend {
fn backend_type(&self) -> BackendType {
// Would need to extend BackendType enum
BackendType::Custom
}
fn is_available(&self) -> bool {
true // Your availability logic
}
// Implement all required methods...
fn zeros(&self, shape: &Shape, dtype: DType) -> Result<Storage> {
// Your implementation
}
// ... more methods
}
Memory Management
Reference Counting
- Tensors use
Arc<dyn Backend>
for backend sharing - Storage is reference counted within each backend
- Automatic cleanup when last reference is dropped
Cross-Backend Memory
- Converting between backends allocates new memory
- Original data remains valid until all references dropped
- No automatic synchronization between backends
GPU Memory Management
- WGPU backend uses WGPU's automatic memory management
- CUDA backend manually manages device memory with proper cleanup
- Out-of-memory errors are propagated as
TensorError::BackendError
Operations Reference
This page provides detailed specifications for all tensor operations in Tensor Frame.
Arithmetic Operations
Element-wise Binary Operations
All arithmetic operations support automatic NumPy-style broadcasting, allowing operations between tensors of different but compatible shapes.
Addition (+
)
fn add(lhs: &Tensor, rhs: &Tensor) -> Result<Tensor>
Computes element-wise addition: output[i] = lhs[i] + rhs[i]
Broadcasting: Yes
Supported shapes: Any compatible shapes
Error conditions: Shape incompatibility
let a = Tensor::ones(vec![2, 3])?;
let b = Tensor::ones(vec![2, 3])?;
let c = a + b; // All elements = 2.0
// Broadcasting example
let x = Tensor::from_vec(vec![1.0, 2.0], vec![2, 1])?;
let y = Tensor::from_vec(vec![10.0, 20.0, 30.0], vec![1, 3])?;
let z = x + y; // Shape: [2, 3]
Subtraction (-
)
fn sub(lhs: &Tensor, rhs: &Tensor) -> Result<Tensor>
Computes element-wise subtraction: output[i] = lhs[i] - rhs[i]
Broadcasting: Yes
Supported shapes: Any compatible shapes
Error conditions: Shape incompatibility
// Same shapes
let a = Tensor::from_vec(vec![5.0, 6.0, 7.0, 8.0], vec![2, 2])?;
let b = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
let c = a - b; // [4.0, 4.0, 4.0, 4.0]
// With broadcasting
let x = Tensor::from_vec(vec![10.0, 20.0], vec![2, 1])?;
let y = Tensor::from_vec(vec![1.0, 2.0, 3.0], vec![1, 3])?;
let z = x - y; // Shape: [2, 3], values: [[9, 8, 7], [19, 18, 17]]
Multiplication (*
)
fn mul(lhs: &Tensor, rhs: &Tensor) -> Result<Tensor>
Computes element-wise multiplication: output[i] = lhs[i] * rhs[i]
Note: This is element-wise multiplication (Hadamard product), not matrix multiplication.
Broadcasting: Yes
Supported shapes: Any compatible shapes
Error conditions: Shape incompatibility
// Broadcasting with a row vector
let matrix = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], vec![2, 3])?;
let row = Tensor::from_vec(vec![10.0, 20.0, 30.0], vec![3])?;
let scaled = matrix * row; // Each row multiplied by [10, 20, 30]
Division (/
)
fn div(lhs: &Tensor, rhs: &Tensor) -> Result<Tensor>
Computes element-wise division: output[i] = lhs[i] / rhs[i]
Broadcasting: Yes
Supported shapes: Any compatible shapes
Error conditions: Shape incompatibility
Special handling: Division by zero follows IEEE 754 standards:
x/0.0
wherex > 0
→+∞
x/0.0
wherex < 0
→-∞
0.0/0.0
→NaN
// Divide by scalar (broadcast)
let tensor = Tensor::from_vec(vec![2.0, 4.0, 6.0, 8.0], vec![2, 2])?;
let scalar = Tensor::from_vec(vec![2.0], vec![])?; // Scalar tensor
let result = tensor / scalar; // [1.0, 2.0, 3.0, 4.0]
// Broadcasting example
let x = Tensor::from_vec(vec![100.0, 200.0], vec![2, 1])?;
let y = Tensor::from_vec(vec![1.0, 2.0, 4.0], vec![1, 3])?;
let z = x / y; // Shape: [2, 3], values: [[100, 50, 25], [200, 100, 50]]
Reduction Operations
Sum
fn sum(&self, axis: Option<usize>) -> Result<Tensor>
Computes sum along specified axis or all elements.
Parameters:
axis: None
- Sum all elements, return scalar tensoraxis: Some(i)
- Sum along axisi
, reduce that dimension
Supported shapes: Any
Backend support:
- CPU: Full native support for axis-specific reductions
- WGPU: Full support for axis-specific reductions (CPU fallback)
- CUDA: Full support for axis-specific reductions (CPU fallback)
let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
// Sum all elements (all backends)
let total = tensor.sum(None)?; // Result: [10.0] (scalar)
// Axis-specific sums (all backends)
let col_sums = tensor.sum(Some(0))?; // Result: [4.0, 6.0] (shape: [2])
let row_sums = tensor.sum(Some(1))?; // Result: [3.0, 7.0] (shape: [2])
Mean
fn mean(&self, axis: Option<usize>) -> Result<Tensor>
Computes arithmetic mean along specified axis or all elements.
Parameters:
axis: None
- Mean of all elements, return scalar tensoraxis: Some(i)
- Mean along axisi
, reduce that dimension
Supported shapes: Any
Backend support:
- CPU: Full native support for axis-specific reductions
- WGPU: Full support for axis-specific reductions (CPU fallback)
- CUDA: Full support for axis-specific reductions (CPU fallback)
let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
// Mean of all elements (all backends)
let average = tensor.mean(None)?; // Result: [2.5] (scalar)
// Axis-specific means (all backends)
let col_means = tensor.mean(Some(0))?; // Result: [2.0, 3.0] (shape: [2])
let row_means = tensor.mean(Some(1))?; // Result: [1.5, 3.5] (shape: [2])
Shape Manipulation
Reshape
fn reshape(&self, new_shape: Vec<usize>) -> Result<Tensor>
Changes tensor shape while preserving total number of elements.
Requirements:
- Product of new_shape must equal
self.numel()
- New shape cannot have zero dimensions
Error conditions:
- Incompatible total elements
- Invalid shape dimensions
let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], vec![2, 3])?;
let reshaped = tensor.reshape(vec![3, 2])?; // 2×3 -> 3×2
let flattened = tensor.reshape(vec![6])?; // 2×3 -> 6×1
Transpose
fn transpose(&self) -> Result<Tensor>
Transposes a 2D tensor (swaps dimensions).
Requirements: Tensor must be exactly 2D
Error conditions: Non-2D tensor
let matrix = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
let transposed = matrix.transpose()?;
// [[1,2],[3,4]] -> [[1,3],[2,4]]
Squeeze
fn squeeze(&self, dim: Option<usize>) -> Result<Tensor>
Removes dimensions of size 1.
Parameters:
dim: None
- Remove all dimensions of size 1dim: Some(i)
- Remove dimensioni
only if it has size 1
Error conditions:
- Invalid dimension index
- Trying to squeeze dimension with size > 1
let tensor = Tensor::ones(vec![1, 3, 1, 2])?; // Shape: [1, 3, 1, 2]
let squeezed = tensor.squeeze(None)?; // Shape: [3, 2]
let partial = tensor.squeeze(Some(0))?; // Shape: [3, 1, 2]
Unsqueeze
fn unsqueeze(&self, dim: usize) -> Result<Tensor>
Adds a dimension of size 1 at the specified position.
Parameters:
dim
- Position to insert new dimension (0 to ndim inclusive)
Error conditions: Invalid dimension index (> ndim)
let tensor = Tensor::ones(vec![3, 2])?; // Shape: [3, 2]
let unsqueezed = tensor.unsqueeze(0)?; // Shape: [1, 3, 2]
let middle = tensor.unsqueeze(1)?; // Shape: [3, 1, 2]
let end = tensor.unsqueeze(2)?; // Shape: [3, 2, 1]
Broadcasting Rules
Tensor Frame follows NumPy/PyTorch broadcasting conventions:
Alignment
Shapes are aligned from the rightmost dimension:
Tensor A: [3, 1, 4]
Tensor B: [2, 4]
Result: [3, 2, 4]
Size 1 Expansion
Dimensions of size 1 are expanded to match:
Tensor A: [3, 1, 4]
Tensor B: [3, 2, 1]
Result: [3, 2, 4]
Missing Dimensions
Missing leading dimensions are treated as size 1:
Tensor A: [5, 3, 2]
Tensor B: [3, 2]
Result: [5, 3, 2]
Incompatible Shapes
These shapes cannot be broadcast:
Tensor A: [3, 4]
Tensor B: [2, 4] # Error: 3 and 2 cannot be broadcast
Performance Notes
Operation Fusion
- Operations on the same backend avoid intermediate allocations when possible
- Sequential reductions can be fused into single kernel calls
Memory Layout
- All tensors use row-major (C-style) memory layout
- Reshape operations are zero-copy when layout permits
- Transpose creates new memory layout
Backend-Specific Optimizations
- CPU: Uses Rayon for parallel element-wise operations
- WGPU: Utilizes compute shaders for parallel GPU execution
- CUDA: Uses custom kernels for all operations
Broadcasting Performance
- Zero-copy broadcasting when one tensor has size-1 dimensions
- Explicit memory expansion fallback for complex broadcasting patterns
- GPU backends optimize broadcasting in compute shaders
Backends Overview
Tensor Frame's backend system provides a pluggable architecture for running tensor operations on different computational devices. This allows the same high-level tensor API to transparently utilize CPU cores, integrated GPUs, discrete GPUs, and specialized accelerators.
Available Backends
Backend | Feature Flag | Availability | Best Use Cases |
---|---|---|---|
CPU | cpu (default) | Always | Small tensors, development, fallback |
WGPU | wgpu | Cross-platform GPU | Large tensors, cross-platform deployment |
CUDA | cuda | NVIDIA GPUs | High-performance production workloads |
Backend Selection Strategy
Automatic Selection (Recommended)
By default, Tensor Frame automatically selects the best available backend using this priority order:
- CUDA - Highest performance on NVIDIA hardware
- WGPU - Cross-platform GPU acceleration
- CPU - Universal fallback
use tensor_frame::Tensor;
// Automatically uses best available backend
let tensor = Tensor::zeros(vec![1000, 1000])?;
println!("Using backend: {:?}", tensor.backend_type());
Manual Backend Control
For specific requirements, you can control backend selection:
use tensor_frame::backend::{set_backend_priority, BackendType};
// Force CPU-only execution
let backend = set_backend_priority(vec![BackendType::Cpu]);
// Prefer WGPU over CUDA
let backend = set_backend_priority(vec![
BackendType::Wgpu,
BackendType::Cuda,
BackendType::Cpu
]);
Per-Tensor Backend Conversion
Convert individual tensors between backends:
let cpu_tensor = Tensor::ones(vec![100, 100])?;
// Move to GPU
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
// Move back to CPU
let back_to_cpu = gpu_tensor.to_backend(BackendType::Cpu)?;
Performance Characteristics
CPU Backend
- Latency: Lowest for small operations (< 1ms)
- Throughput: Limited by CPU cores and memory bandwidth
- Memory: System RAM (typically abundant)
- Parallelism: Thread-level via Rayon
- Overhead: Minimal function call overhead
WGPU Backend
- Latency: Higher initialization cost (~1-10ms)
- Throughput: High for large, parallel operations
- Memory: GPU memory (limited but fast)
- Parallelism: Massive thread-level via compute shaders
- Overhead: GPU command submission and synchronization
CUDA Backend
- Latency: Moderate initialization cost (~0.1-1ms)
- Throughput: Highest for supported operations
- Memory: GPU memory with CUDA optimizations
- Parallelism: Optimal GPU utilization via cuBLAS/cuDNN
- Overhead: Minimal with mature driver stack
When to Use Each Backend
CPU Backend
// Good for:
let small_tensor = Tensor::ones(vec![10, 10])?; // Small tensors
let dev_tensor = Tensor::zeros(vec![100])?; // Development/testing
let scalar_ops = tensor.sum(None)?; // Scalar results
// Avoid for:
// - Large matrix multiplications (> 1000x1000)
// - Batch operations on many tensors
// - Compute-intensive element-wise operations
WGPU Backend
// Good for:
let large_tensor = Tensor::zeros(vec![2048, 2048])?; // Large tensors
let batch_ops = tensors.iter().map(|t| t * 2.0); // Batch operations
let element_wise = (a * b) + c; // Complex element-wise
// Consider for:
// - Cross-platform deployment
// - When CUDA is not available
// - Mixed CPU/GPU workloads
CUDA Backend
// Excellent for:
let huge_tensor = Tensor::zeros(vec![4096, 4096])?; // Very large tensors
let matrix_mul = a.matmul(&b)?; // Matrix operations
let ml_workload = model.forward(input)?; // ML training/inference
// Best when:
// - NVIDIA GPU available
// - Performance is critical
// - Using alongside other CUDA libraries
Cross-Backend Operations
Operations between tensors on different backends automatically handle conversion:
let cpu_a = Tensor::ones(vec![1000])?;
let gpu_b = Tensor::zeros(vec![1000])?.to_backend(BackendType::Wgpu)?;
// Automatically converts to common backend
let result = cpu_a + gpu_b; // Runs on CPU backend
Conversion Rules:
- If backends match, operation runs on that backend
- If backends differ, converts to the "lower priority" backend
- Priority order: CPU > WGPU > CUDA (CPU is most compatible)
Memory Management
Reference Counting
All backends use reference counting for efficient memory management:
let tensor1 = Tensor::ones(vec![1000, 1000])?;
let tensor2 = tensor1.clone(); // O(1) - just increments reference count
// Memory freed automatically when last reference dropped
Cross-Backend Memory
Converting between backends allocates new memory:
let cpu_tensor = Tensor::ones(vec![1000])?; // 4KB CPU memory
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; // +4KB GPU memory
// Both tensors exist independently until dropped
Memory Usage Guidelines
- Development: Use CPU backend to avoid GPU memory pressure
- Production: Convert to GPU early, minimize cross-backend copies
- Mixed workloads: Keep frequently-accessed tensors on CPU
- Large datasets: Stream data through GPU backends
Error Handling
Backend operations can fail for various reasons:
match Tensor::zeros(vec![100000, 100000]) {
Ok(tensor) => println!("Created tensor on {:?}", tensor.backend_type()),
Err(TensorError::BackendError(msg)) => {
eprintln!("Backend error: {}", msg);
// Fallback to smaller size or different backend
}
Err(e) => eprintln!("Other error: {}", e),
}
Common Error Scenarios:
- GPU Out of Memory: Try smaller tensors or CPU backend
- Backend Unavailable: Fallback to CPU backend
- Feature Not Implemented: Some operations only available on certain backends
- Cross-Backend Type Mismatch: Ensure compatible data types
Backend Implementation Status
Operation | CPU | WGPU | CUDA |
---|---|---|---|
Basic arithmetic (+, -, *, /) | ✅ | ✅ | ✅ |
Reductions (sum, mean) | ✅ | ❌ | ✅ |
Reshape, transpose | ✅ | ✅ | ✅ |
Broadcasting | ✅ | ✅ | ✅ |
✅ = Fully implemented
❌ = Not yet implemented
⚠️ = Partially implemented
CPU Backend
The CPU backend is the default and most mature backend in Tensor Frame. It provides reliable tensor operations using system memory and CPU cores, with parallelization via the Rayon library.
Features
- Always Available: No additional dependencies required
- Parallel Processing: Multi-threaded operations via Rayon
- Full API Support: All tensor operations implemented
- Memory Efficient: Direct Vec
storage without additional overhead - Debugging Friendly: Easy inspection with standard debugging tools
Configuration
The CPU backend is enabled by default:
[dependencies]
tensor_frame = "0.0.3-alpha" # CPU backend included
Or explicitly:
[dependencies]
tensor_frame = { version = "0.0.3-alpha", features = ["cpu"] }
Implementation Details
Storage
CPU tensors use standard Rust Vec<f32>
for data storage:
pub enum Storage {
Cpu(Vec<f32>), // Direct vector storage
// ...
}
This provides:
- Memory Layout: Contiguous, row-major (C-style) layout
- Access: Direct memory access without marshaling overhead
- Debugging: Easy inspection with standard Rust tools
Parallelization
The CPU backend uses Rayon for data-parallel operations:
// Element-wise operations are parallelized
a.par_iter()
.zip(b.par_iter())
.map(|(a, b)| a + b)
.collect()
Thread Pool: Rayon automatically manages a global thread pool sized to the number of CPU cores.
Granularity: Operations are automatically chunked for optimal parallel efficiency.
Performance Characteristics
Strengths
- Low Latency: Minimal overhead for small operations
- Predictable: Performance scales linearly with data size and core count
- Memory Bandwidth: Efficiently utilizes system memory bandwidth
- Cache Friendly: Good locality for sequential operations
Limitations
- Compute Bound: Limited by CPU ALU throughput
- Memory Bound: Large operations limited by RAM bandwidth
- Thread Overhead: Parallel overhead dominates for small tensors
Performance Guidelines
Optimal Use Cases
// Small to medium tensors (< 10K elements)
let small = Tensor::ones(vec![100, 100])?;
// Scalar reductions
let sum = large_tensor.sum(None)?;
// Development and prototyping
let test_tensor = Tensor::from_vec(test_data, shape)?;
Suboptimal Use Cases
// Very large tensor operations
let huge_op = a + b; // Consider GPU for very large tensors
// Repeated large element-wise operations
for _ in 0..1000 {
result = (a.clone() * b.clone())?; // GPU would be faster
}
Memory Management
Allocation
CPU tensors allocate memory directly from the system heap:
let tensor = Tensor::zeros(vec![1000, 1000])?; // Allocates 4MB
Reference Counting
Tensors use Arc<Vec<f32>>
internally for efficient cloning:
let tensor1 = Tensor::ones(vec![1000])?;
let tensor2 = tensor1.clone(); // O(1) reference count increment
// Memory shared until one tensor is modified (copy-on-write semantics)
Memory Usage
Monitor memory usage with standard system tools:
# Linux
cat /proc/meminfo
# macOS
vm_stat
# Windows
wmic OS get TotalVisibleMemorySize,FreePhysicalMemory
Debugging and Profiling
Tensor Inspection
CPU tensors are easy to inspect:
let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
// Direct access to underlying data
let data = tensor.to_vec()?;
println!("Raw data: {:?}", data);
// Shape information
println!("Shape: {:?}", tensor.shape().dims());
println!("Elements: {}", tensor.numel());
Performance Profiling
Use standard Rust profiling tools:
// Add timing
use std::time::Instant;
let start = Instant::now();
let result = large_tensor.sum(None)?;
println!("CPU operation took: {:?}", start.elapsed());
For detailed profiling:
# Install flamegraph
cargo install flamegraph
# Profile your application
cargo flamegraph --bin your_app
Thread Analysis
Monitor Rayon thread usage:
// Check thread pool size
println!("Rayon threads: {}", rayon::current_num_threads());
// Custom thread pool
let pool = rayon::ThreadPoolBuilder::new()
.num_threads(4)
.build()?;
pool.install(|| {
// Operations here use 4 threads max
let result = tensor1 + tensor2;
});
Error Handling
CPU backend errors are typically related to memory allocation:
use tensor_frame::{Tensor, TensorError};
match Tensor::zeros(vec![100000, 100000]) {
Ok(tensor) => {
// Success - 40GB allocated
}
Err(TensorError::BackendError(msg)) => {
// Likely out of memory
eprintln!("CPU backend error: {}", msg);
}
Err(e) => {
eprintln!("Other error: {}", e);
}
}
Common Error Conditions:
- Out of Memory: Requesting more memory than available
- Integer Overflow: Tensor dimensions too large for address space
- Thread Panic: Rayon worker thread panics (rare)
Optimization Tips
Memory Layout Optimization
// Prefer contiguous operations
let result = (a + b) * c; // Better than separate operations
// Avoid unnecessary allocations
let result = a.clone() + b; // Creates temporary clone
let result = &a + &b; // Better - uses references
Parallel Operation Tuning
// For very small tensors, disable parallelism
let small_result = small_a + small_b; // Rayon decides automatically
// For custom control
rayon::ThreadPoolBuilder::new()
.num_threads(1) // Force single-threaded
.build_global()?;
Cache Optimization
// Process data in blocks for better cache usage
for chunk in tensor.chunks(cache_friendly_size) {
// Process chunk
}
// Transpose cache-friendly
let transposed = matrix.transpose()?; // May benefit from blocking
Integration with Other Libraries
NumPy Compatibility
// Convert to/from Vec for NumPy interop
let tensor = Tensor::from_vec(numpy_data, shape)?;
let back_to_numpy = tensor.to_vec()?;
ndarray Integration
use ndarray::Array2;
// Convert from ndarray
let nd_array = Array2::from_shape_vec((2, 2), vec![1.0, 2.0, 3.0, 4.0])?;
let tensor = Tensor::from_vec(nd_array.into_raw_vec(), vec![2, 2])?;
// Convert to ndarray
let data = tensor.to_vec()?;
let shape = tensor.shape().dims();
let nd_array = Array2::from_shape_vec((shape[0], shape[1]), data)?;
BLAS Integration
For maximum performance, consider linking with optimized BLAS:
[dependencies]
tensor_frame = "0.0.3-alpha"
blas-src = { version = "0.8", features = ["openblas"] }
This can significantly speed up matrix operations on the CPU backend.
WGPU Backend
The WGPU backend provides cross-platform GPU compute acceleration using the WebGPU standard. It supports Metal, Vulkan, DirectX 12, and OpenGL backends, making it an excellent choice for portable high-performance computing.
Features
- Cross-Platform: Works on Windows, macOS, Linux, iOS, Android, and Web
- Multiple APIs: Supports Vulkan, Metal, DX12, DX11, OpenGL ES, and WebGL
- Compute Shaders: Uses WGSL (WebGPU Shading Language) for parallel operations
- Memory Efficient: GPU buffer management with automatic cleanup
- Future-Proof: Based on the emerging WebGPU standard
Installation
Enable the WGPU backend with the feature flag:
[dependencies]
tensor_frame = { version = "0.0.3-alpha", features = ["wgpu"] }
Additional Dependencies:
- No platform-specific GPU drivers required
- Uses system graphics drivers (Metal, Vulkan, DirectX, OpenGL)
System Requirements
Minimum Requirements
- GPU: Any GPU with compute shader support
- Driver: Up-to-date graphics drivers
- Memory: Sufficient GPU memory for tensor data
Supported Platforms
Platform | Graphics API | Status |
---|---|---|
Windows | DirectX 12, Vulkan | ✅ Full support |
Windows | DirectX 11 | ✅ Fallback support |
macOS | Metal | ✅ Full support |
Linux | Vulkan | ✅ Full support |
Linux | OpenGL ES | ⚠️ Limited support |
iOS | Metal | ✅ Full support |
Android | Vulkan, OpenGL ES | ✅ Full support |
Web | WebGPU, WebGL2 | ⚠️ Experimental |
Implementation Details
Storage
WGPU tensors use GPU buffers for data storage:
pub struct WgpuStorage {
pub buffer: Arc<wgpu::Buffer>, // GPU buffer handle
}
Buffer Properties:
- Location: GPU memory (VRAM)
- Layout: Contiguous, row-major layout
- Usage: Storage buffers with compute shader access
- Synchronization: Automatic via command queue
Compute Shaders
Operations are implemented as WGSL compute shaders loaded from external files in src/shaders/
:
add.wgsl
- Element-wise additionsub.wgsl
- Element-wise subtractionmul.wgsl
- Element-wise multiplicationdiv.wgsl
- Element-wise division with IEEE 754 compliance
// Example: Element-wise addition shader (add.wgsl)
@group(0) @binding(0) var<storage, read> input_a: array<f32>;
@group(0) @binding(1) var<storage, read> input_b: array<f32>;
@group(0) @binding(2) var<storage, read_write> output: array<f32>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let index = global_id.x;
if (index >= arrayLength(&input_a)) {
return;
}
output[index] = input_a[index] + input_b[index];
}
Parallelization
- Workgroups: Operations dispatched in parallel workgroups
- Thread Count: Automatically sized based on tensor dimensions
- GPU Utilization: Optimized for high occupancy on modern GPUs
Performance Characteristics
Strengths
- Massive Parallelism: Thousands of parallel threads
- High Throughput: Excellent for large tensor operations
- Memory Bandwidth: High GPU memory bandwidth utilization
- Compute Density: Specialized compute units for arithmetic operations
Limitations
- Latency: GPU command submission and synchronization overhead
- Memory Transfer: CPU-GPU data transfers can be expensive
- Limited Precision: Currently only supports f32 operations
- Shader Compilation: First-use compilation overhead
Performance Guidelines
Optimal Use Cases
// Large tensor operations (> 10K elements)
let large = Tensor::zeros(vec![2048, 2048])?;
let result = (large_a * large_b) + large_c;
// Repeated operations on same-sized tensors
for batch in batches {
let output = model.forward(batch)?; // Shader programs cached
}
// Element-wise operations with complex expressions
let result = ((a * b) + c).sqrt(); // Single GPU kernel
Suboptimal Use Cases
// Very small tensors
let small = Tensor::ones(vec![10, 10])?; // GPU overhead dominates
// Frequent CPU-GPU transfers
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
let back_to_cpu = gpu_tensor.to_vec()?; // Expensive transfers
// Scalar operations
let sum = tensor.sum(None)?; // Result copied back to CPU
Memory Management
GPU Memory Allocation
WGPU automatically manages GPU memory:
let tensor = Tensor::zeros(vec![2048, 2048])?; // Allocates ~16MB GPU memory
Memory Pool: WGPU uses internal memory pools for efficient allocation
Garbage Collection: Buffers automatically freed when last reference dropped
Fragmentation: Large allocations may fail even with sufficient total memory
Memory Transfer Patterns
// Efficient: Create on GPU
let gpu_tensor = Tensor::zeros(vec![1000, 1000])?
.to_backend(BackendType::Wgpu)?;
// Inefficient: Frequent transfers
let result = cpu_data.to_backend(BackendType::Wgpu)?
.sum(None)?
.to_backend(BackendType::Cpu)?; // Multiple transfers
Memory Debugging
Monitor GPU memory usage:
// Check GPU memory limits
let limits = device.limits();
println!("Max buffer size: {} MB", limits.max_buffer_size / (1024*1024));
// Handle out-of-memory errors
match Tensor::zeros(vec![16384, 16384]) {
Ok(tensor) => println!("Allocated 1GB GPU tensor"),
Err(TensorError::BackendError(msg)) if msg.contains("memory") => {
eprintln!("GPU out of memory, trying smaller size");
}
Err(e) => eprintln!("Other error: {}", e),
}
Debugging and Profiling
Shader Debugging
WGPU provides validation and debugging features:
// Enable validation (debug builds)
let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor {
backends: wgpu::Backends::all(),
flags: wgpu::InstanceFlags::DEBUG | wgpu::InstanceFlags::VALIDATION,
..Default::default()
});
Performance Profiling
Use GPU profiling tools:
Windows (DirectX):
- PIX for Windows
- RenderDoc
- Visual Studio Graphics Diagnostics
macOS (Metal):
- Xcode Instruments (GPU Timeline)
- Metal System Trace
Linux (Vulkan):
- RenderDoc
- Vulkan Tools
Custom Timing
use std::time::Instant;
let start = Instant::now();
let result = gpu_tensor_a + gpu_tensor_b;
// Note: GPU operations are asynchronous!
let _data = result.to_vec()?; // Synchronization point
println!("GPU operation took: {:?}", start.elapsed());
Error Handling
WGPU backend errors can occur at multiple levels:
Device Creation Errors
match WgpuBackend::new() {
Ok(backend) => println!("WGPU backend ready"),
Err(TensorError::BackendError(msg)) => {
eprintln!("WGPU initialization failed: {}", msg);
// Fallback to CPU backend
}
}
Runtime Errors
// Out of GPU memory
let result = Tensor::zeros(vec![100000, 100000]); // May fail
// Shader compilation errors (rare)
let result = custom_operation(tensor); // May fail for invalid shaders
// Device lost (driver reset, etc.)
let result = tensor.sum(None); // May fail if device is lost
Common Error Scenarios:
- Device Not Found: No compatible GPU available
- Out of Memory: GPU memory exhausted
- Driver Issues: Outdated or buggy graphics drivers
- Unsupported Operations: Feature not implemented in WGPU backend
Platform-Specific Notes
Windows
- DirectX 12: Best performance and feature support
- Vulkan: Good alternative if DX12 not available
- DirectX 11: Fallback with limited compute support
macOS
- Metal: Excellent native support and performance
- MoltenVK: Vulkan compatibility layer (not recommended for production)
Linux
- Vulkan: Primary choice with best performance
- OpenGL: Fallback with limited compute features
- Graphics Drivers: Ensure latest Mesa/NVIDIA/AMD drivers
Mobile (iOS/Android)
- iOS: Metal provides excellent mobile GPU performance
- Android: Vulkan on newer devices, OpenGL ES fallback
- Power Management: Be aware of thermal throttling
Web (Experimental)
- WebGPU: Emerging standard with excellent performance potential
- WebGL2: Fallback with compute shader emulation
- Browser Support: Chrome/Edge (flag), Firefox (experimental)
Optimization Tips
Workgroup Size Tuning
// Optimal workgroup sizes depend on GPU architecture
// Current default: 64 threads per workgroup
// Nvidia: 32 (warp size) or 64
// AMD: 64 (wavefront size)
// Intel: 32 or 64
// Mobile: 16 or 32
Batch Operations
// Efficient: Batch similar operations
let results: Vec<Tensor> = inputs
.iter()
.map(|input| model.forward(input))
.collect()?;
// Inefficient: Individual operations
for input in inputs {
let result = model.forward(input)?;
let cpu_result = result.to_vec()?; // Forces synchronization
}
Memory Layout Optimization
// Ensure tensor shapes are GPU-friendly
let aligned_size = (size + 63) & !63; // Align to 64-element boundaries
let tensor = Tensor::zeros(vec![aligned_size, aligned_size])?;
Future Developments
The WGPU backend is actively developed with planned improvements:
- Reduction Operations: Sum, mean, and other reductions on GPU
- Advanced Operations: GPU-optimized tensor operations
- Mixed Precision: f16 and bf16 data type support
- Async Operations: Fully asynchronous GPU command queues
- WebGPU Stability: Production-ready web deployment
CUDA Backend
The CUDA backend provides high-performance tensor operations on NVIDIA GPUs using the CUDA toolkit. It offers the highest performance for supported operations and integrates well with the broader CUDA ecosystem.
Features
- Peak Performance: Optimized kernels for maximum NVIDIA GPU utilization
- Optimized Kernels: Hardware-accelerated tensor operations
- Memory Optimization: Efficient GPU memory management
- Mature Ecosystem: Integration with existing CUDA libraries
- Production Ready: Battle-tested in production environments
Installation
Prerequisites
CUDA Toolkit: Install NVIDIA CUDA Toolkit 11.0 or later
- Download from NVIDIA Developer
- Ensure
nvcc
is in your PATH - Verify installation:
nvcc --version
Compatible GPU: NVIDIA GPU with compute capability 3.5+
- Check compatibility:
nvidia-smi
- Verify compute capability:
deviceQuery
(CUDA samples)
Cargo Configuration
Enable the CUDA backend:
[dependencies]
tensor_frame = { version = "0.0.3-alpha", features = ["cuda"] }
Build Requirements:
- CUDA Toolkit installed
- NVIDIA GPU drivers
- C++ compiler (MSVC on Windows, GCC/Clang on Linux)
System Requirements
Hardware
- GPU: NVIDIA GPU with compute capability 3.5+
- Memory: Sufficient GPU memory for tensor operations
- PCIe: PCIe 3.0 x16 recommended for optimal memory bandwidth
Software
- CUDA Toolkit: Version 11.0+ (12.0+ recommended)
- Driver: NVIDIA driver supporting your CUDA version
- OS: Linux (preferred), Windows 10+, WSL2
Verified Configurations
GPU Generation | Compute Capability | CUDA Version | Status |
---|---|---|---|
Maxwell (GTX 900) | 5.0, 5.2 | 11.0+ | ✅ Supported |
Pascal (GTX 10x0) | 6.0, 6.1 | 11.0+ | ✅ Fully supported |
Volta (V100) | 7.0 | 11.0+ | ✅ Optimized |
Turing (RTX 20x0) | 7.5 | 11.0+ | ✅ Optimized |
Ampere (RTX 30x0) | 8.0, 8.6 | 11.2+ | ✅ Optimal |
Ada (RTX 40x0) | 8.9 | 12.0+ | ✅ Latest features |
Implementation Details
Storage
CUDA tensors use device memory pointers:
pub struct CudaStorage {
pub ptr: *mut f32, // Raw CUDA device pointer
pub len: usize, // Buffer length in elements
}
Memory Properties:
- Location: GPU global memory (VRAM)
- Layout: Contiguous, row-major layout
- Alignment: 256-byte aligned for optimal coalescing
- Synchronization: Explicit via CUDA streams
Kernel Implementation
Operations use optimized CUDA kernels:
// Element-wise addition kernel
__global__ void add_kernel(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
Performance Characteristics
Strengths
- Compute Throughput: Maximum FP32/FP16 throughput on NVIDIA hardware
- Memory Bandwidth: Optimal utilization of GPU memory bandwidth
- Kernel Optimization: Hand-tuned kernels for each operation
- Library Integration: Designed for future integration with cuDNN, etc.
Performance Metrics
Example performance on RTX 4090:
Operation | Tensor Size | CPU (32 cores) | CUDA | Speedup |
---|---|---|---|---|
Element-wise Add | 1M elements | 2.1 ms | 0.18 ms | 12x |
Matrix Multiply | 2048x2048 | 450 ms | 8.2 ms | 55x |
Reduction Sum | 16M elements | 15 ms | 0.52 ms | 29x |
Optimization Guidelines
Optimal Use Cases
// Large tensor operations
let a = Tensor::zeros(vec![4096, 4096])?;
let b = Tensor::zeros(vec![4096, 4096])?;
let c = (a * b) + 1.0; // Excellent GPU performance
// Batch operations
for batch in large_dataset {
let result = model.forward(batch)?; // Amortizes GPU overhead
}
// Memory-bound operations
let result = ((a * b) + c) / d; // GPU memory bandwidth utilized
Suboptimal Use Cases
// Very small tensors
let tiny = Tensor::ones(vec![8, 8])?; // Kernel launch overhead dominates
// Frequent host-device transfers
let gpu_result = cpu_tensor.to_backend(BackendType::Cuda)?;
let back_to_cpu = gpu_result.to_vec()?; // PCIe bandwidth bottleneck
// Scalar reductions with immediate use
let sum = tensor.sum(None)?.to_vec()?; // Forces synchronization
Memory Management
Device Memory Allocation
CUDA tensors allocate GPU memory directly:
// Allocates 64MB of GPU memory
let large_tensor = Tensor::zeros(vec![4096, 4096])?
.to_backend(BackendType::Cuda)?;
Memory Pool Management
The backend uses a memory pool for efficient allocation:
// Pool reduces allocation overhead
let tensors: Vec<Tensor> = (0..100)
.map(|_| Tensor::zeros(vec![1024, 1024]))
.collect::<Result<Vec<_>>>()?;
Memory Transfer Optimization
// Efficient: Batch transfers
let gpu_tensors = cpu_tensors
.into_iter()
.map(|t| t.to_backend(BackendType::Cuda))
.collect::<Result<Vec<_>>>()?;
// Inefficient: Individual transfers
for cpu_tensor in cpu_tensors {
let gpu_tensor = cpu_tensor.to_backend(BackendType::Cuda)?;
process(gpu_tensor)?;
}
Memory Debugging
Monitor GPU memory usage:
# Check GPU memory
nvidia-smi
# Continuous monitoring
watch -n 1 nvidia-smi
// Check available memory
let (free, total) = cuda::memory_info()?;
println!("GPU memory: {}/{} MB", free / 1024 / 1024, total / 1024 / 1024);
// Handle out-of-memory
match Tensor::zeros(vec![16384, 16384]).and_then(|t| t.to_backend(BackendType::Cuda)) {
Ok(tensor) => println!("Allocated 1GB GPU tensor"),
Err(TensorError::BackendError(msg)) if msg.contains("memory") => {
eprintln!("GPU OOM, trying smaller allocation");
}
Err(e) => eprintln!("CUDA error: {}", e),
}
Error Handling
CUDA operations can fail for various hardware and software reasons:
Runtime Errors
use tensor_frame::{Tensor, TensorError};
match tensor_operation() {
Ok(result) => process(result),
Err(TensorError::BackendError(msg)) => {
if msg.contains("out of memory") {
// GPU memory exhausted
fallback_to_cpu()?;
} else if msg.contains("invalid device") {
// GPU not available or driver issue
retry_with_cpu_backend()?;
} else {
// Other CUDA error
eprintln!("CUDA error: {}", msg);
}
}
}
Common Error Scenarios
- GPU Out of Memory: Tensor too large for available GPU memory
- Invalid Device: GPU not found or not compatible
- Driver Mismatch: CUDA driver version incompatible
- Kernel Launch Failed: Invalid kernel parameters or GPU fault
- Memory Access Violation: Invalid GPU memory access
Error Recovery
// Graceful fallback strategy
fn robust_tensor_operation(tensor: Tensor) -> Result<Tensor> {
// Try CUDA first
if let Ok(cuda_tensor) = tensor.to_backend(BackendType::Cuda) {
match cuda_operation(cuda_tensor) {
Ok(result) => return Ok(result),
Err(TensorError::BackendError(_)) => {
// CUDA failed, fall back to CPU
eprintln!("CUDA operation failed, falling back to CPU");
}
}
}
// CPU fallback
cpu_operation(tensor.to_backend(BackendType::Cpu)?)
}
Debugging and Profiling
CUDA Debugging Tools
NVIDIA Nsight Systems: System-wide performance analysis
nsys profile --stats=true ./your_app
NVIDIA Nsight Compute: Kernel-level profiling
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./your_app
cuda-memcheck: Memory error detection
cuda-memcheck ./your_app
Performance Analysis
// GPU timing with CUDA events
use std::time::Instant;
let start = Instant::now();
let result = gpu_tensor_a.matmul(&gpu_tensor_b)?;
// Note: matmul is asynchronous!
let _sync = result.to_vec()?; // Force synchronization
let elapsed = start.elapsed();
println!("Matrix multiplication took: {:?}", elapsed);
Memory Leak Detection
// Monitor for memory leaks in long-running applications
fn check_memory_usage() -> Result<()> {
let (free_before, total) = cuda::memory_info()?;
// Perform operations
{
let tensor = Tensor::zeros(vec![1000, 1000])?.to_backend(BackendType::Cuda)?;
let result = expensive_operation(tensor)?;
} // tensor should be freed here
let (free_after, _) = cuda::memory_info()?;
if free_after < free_before {
eprintln!("Potential memory leak detected!");
eprintln!("Memory delta: {} MB", (free_before - free_after) / 1024 / 1024);
}
Ok(())
}
Production Deployment
Docker Configuration
# Use NVIDIA CUDA base image
FROM nvidia/cuda:12.0-devel-ubuntu20.04
# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
# Copy and build your application
COPY . /app
WORKDIR /app
RUN cargo build --release --features cuda
# Runtime with CUDA
FROM nvidia/cuda:12.0-runtime-ubuntu20.04
COPY --from=0 /app/target/release/your_app /usr/local/bin/
CMD ["your_app"]
Kubernetes Deployment
apiVersion: v1
kind: Pod
spec:
containers:
- name: tensor-app
image: your-app:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
Environment Variables
# Limit GPU memory growth
export CUDA_MEMORY_POOL_TYPE=pool
# Enable GPU timing
export CUDA_LAUNCH_BLOCKING=1
# Select specific GPU
export CUDA_VISIBLE_DEVICES=0
Optimization Best Practices
Memory Access Patterns
// Coalesced memory access (efficient)
let result = tensor_a + tensor_b; // Sequential element access
// Strided access (less efficient)
let transposed = tensor.transpose()?; // May require memory reshape
Kernel Fusion
// Fused operations (single kernel launch)
let result = ((a * b) + c).relu(); // Ideally fused into one kernel
// Separate operations (multiple kernel launches)
let temp1 = a * b;
let temp2 = temp1 + c;
let result = temp2.relu(); // Three separate kernels
Stream Management
// Future: Async operations with CUDA streams
// Currently synchronous, but optimizations planned
let stream_a = cuda::create_stream()?;
let stream_b = cuda::create_stream()?;
// Parallel execution on different streams
let result_a = tensor_a.sum(None).execute_on(stream_a)?;
let result_b = tensor_b.mean(None).execute_on(stream_b)?;
Integration with CUDA Ecosystem
cuDNN (Future)
Planned integration for neural network operations:
// Future: Convolution operations
let output = input.conv2d(&kernel, stride, padding)?;
NCCL (Future)
Multi-GPU communication for distributed computing:
// Future: Multi-GPU operations
let distributed_result = tensor.all_reduce_sum()?;
Examples and Tutorials
This section provides practical examples and tutorials for using Tensor Frame effectively. Each example is designed to demonstrate specific features and common usage patterns.
Getting Started Examples
Perfect for newcomers to Tensor Frame:
- Basic Operations - Tensor creation, arithmetic, and basic manipulation
- Broadcasting - Understanding automatic shape broadcasting
- Custom Backends - Working with different computational backends
Example Categories
Fundamental Operations
Learn the core tensor operations that form the foundation of all computational work:
// Tensor creation
let zeros = Tensor::zeros(vec![3, 4])?;
let ones = Tensor::ones(vec![2, 2])?;
let data = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
// Basic arithmetic
let sum = a + b;
let product = a * b;
let result = (a * 2.0) + b;
Shape Manipulation
Master tensor reshaping and dimension manipulation:
// Reshaping and transposition
let reshaped = tensor.reshape(vec![4, 3])?;
let transposed = matrix.transpose()?;
// Dimension manipulation
let squeezed = tensor.squeeze(None)?;
let unsqueezed = squeezed.unsqueeze(1)?;
Backend Optimization
Learn when and how to use different computational backends:
// Automatic backend selection
let tensor = Tensor::zeros(vec![1000, 1000])?;
// Manual backend control
let gpu_tensor = tensor.to_backend(BackendType::Wgpu)?;
let cuda_tensor = tensor.to_backend(BackendType::Cuda)?;
Running Examples
All examples are located in the examples/
directory of the repository:
# Run basic operations example
cargo run --example basic_operations
# Run with specific backend
cargo run --example basic_operations --features wgpu
cargo run --example basic_operations --features cuda
# Run with all features
cargo run --example basic_operations --features "wgpu,cuda"
Example Structure
Each example follows a consistent structure:
- Setup: Import necessary modules and create test data
- Demonstration: Show the specific feature in action
- Explanation: Detailed comments explaining what's happening
- Performance Notes: Tips for optimal usage
- Error Handling: Proper error handling patterns
Performance Benchmarking
Many examples include performance comparisons:
use std::time::Instant;
// CPU benchmark
let start = Instant::now();
let cpu_result = &cpu_tensor + &cpu_other;
let cpu_time = start.elapsed();
// GPU benchmark
let start = Instant::now();
let gpu_result = &gpu_tensor + &gpu_other;
let _sync = gpu_result.to_vec()?; // Force synchronization
let gpu_time = start.elapsed();
println!("CPU: {:?}, GPU: {:?}, Speedup: {:.1}x",
cpu_time, gpu_time, cpu_time.as_secs_f64() / gpu_time.as_secs_f64());
Interactive Examples
Some examples are designed for interactive exploration:
# Interactive tensor exploration
cargo run --example interactive
# Performance testing with different sizes
cargo run --example benchmark -- --size 1000
cargo run --example benchmark -- --size 2000 --backend cuda
Common Patterns
Error Handling Pattern
use tensor_frame::{Tensor, Result, TensorError};
fn robust_operation() -> Result<Tensor> {
let tensor = Tensor::zeros(vec![1000, 1000])?;
// Try GPU backend first
match tensor.to_backend(BackendType::Wgpu) {
Ok(gpu_tensor) => {
// GPU operations here
Ok(expensive_gpu_operation(gpu_tensor)?)
}
Err(TensorError::BackendError(_)) => {
// Fallback to CPU
println!("GPU not available, using CPU");
Ok(cpu_operation(tensor)?)
}
Err(e) => Err(e),
}
}
Memory Management Pattern
fn memory_efficient_batch_processing(batches: Vec<Vec<f32>>) -> Result<Vec<Tensor>> {
let backend = BackendType::Wgpu; // Choose once
batches
.into_iter()
.map(|batch| {
let tensor = Tensor::from_vec(batch, vec![batch.len()])?;
tensor.to_backend(backend) // Convert once per batch
})
.collect()
}
Broadcasting Pattern
fn demonstrate_broadcasting() -> Result<()> {
// Scalar broadcast
let tensor = Tensor::ones(vec![3, 4])?;
let scaled = tensor * 2.0; // Scalar broadcasts to all elements
// Vector broadcast
let matrix = Tensor::ones(vec![3, 4])?;
let vector = Tensor::ones(vec![4])?; // Shape: [4]
let result = matrix + vector; // Broadcasts to [3, 4]
// Matrix broadcast
let a = Tensor::ones(vec![3, 1])?; // Shape: [3, 1]
let b = Tensor::ones(vec![1, 4])?; // Shape: [1, 4]
let result = a + b; // Result: [3, 4]
Ok(())
}
Advanced Examples
For users comfortable with the basics:
Custom Backend Selection
fn adaptive_backend_selection(tensor_size: usize) -> BackendType {
match tensor_size {
0..=1000 => BackendType::Cpu, // Small: CPU overhead minimal
1001..=100000 => BackendType::Wgpu, // Medium: GPU beneficial
_ => BackendType::Cuda, // Large: Maximum performance
}
}
Batched Operations
fn process_batch_efficiently(inputs: Vec<Tensor>) -> Result<Vec<Tensor>> {
// Convert all inputs to same backend
let backend = BackendType::Wgpu;
let gpu_inputs: Result<Vec<_>> = inputs
.into_iter()
.map(|t| t.to_backend(backend))
.collect();
// Process on GPU
let gpu_outputs: Result<Vec<_>> = gpu_inputs?
.into_iter()
.map(|input| expensive_operation(input))
.collect();
gpu_outputs
}
Troubleshooting Common Issues
Performance Problems
// Problem: Slow operations on small tensors
let small = Tensor::ones(vec![10, 10])?;
let slow_result = small.to_backend(BackendType::Wgpu)?; // GPU overhead
// Solution: Use CPU for small tensors
let fast_result = small; // Stay on CPU backend
Memory Issues
// Problem: GPU out of memory
match Tensor::zeros(vec![10000, 10000]) {
Err(TensorError::BackendError(msg)) if msg.contains("memory") => {
// Solution: Use smaller chunks or CPU backend
let chunks = create_smaller_chunks()?;
process_chunks_individually(chunks)?;
}
Ok(tensor) => process_large_tensor(tensor)?,
Err(e) => return Err(e),
}
Backend Compatibility
// Problem: Operation not supported on backend
let result = match tensor.backend_type() {
BackendType::Wgpu => {
// Some operations not yet implemented on WGPU
tensor.to_backend(BackendType::Cpu)?.complex_operation()?
}
_ => tensor.complex_operation()?,
};
Contributing Examples
We welcome contributions of new examples! Please follow these guidelines:
- Clear Purpose: Each example should demonstrate a specific concept
- Complete Code: Include all necessary imports and error handling
- Documentation: Add detailed comments explaining the concepts
- Performance Notes: Include timing and backend recommendations
- Error Handling: Show proper error handling patterns
See the Contributing Guide for more details on submitting examples.
Basic Operations
This example demonstrates the fundamental tensor operations in Tensor Frame. It covers tensor creation, basic arithmetic, shape manipulation, and data access patterns.
Complete Example
use tensor_frame::{Tensor, Result, TensorOps};
use std::time::Instant;
fn main() -> Result<()> {
println!("=== Tensor Frame Basic Operations ===\n");
// 1. Tensor Creation
tensor_creation_examples()?;
// 2. Basic Arithmetic
arithmetic_examples()?;
// 3. Shape Manipulation
shape_manipulation_examples()?;
// 4. Data Access
data_access_examples()?;
// 5. Performance Comparison
performance_comparison()?;
Ok(())
}
/// Demonstrates various ways to create tensors
fn tensor_creation_examples() -> Result<()> {
println!("=== Tensor Creation ===");
// Create tensor filled with zeros
let zeros = Tensor::zeros(vec![2, 3])?;
println!("Zeros tensor (2x3):\n{}\n", zeros);
// Create tensor filled with ones
let ones = Tensor::ones(vec![3, 2])?;
println!("Ones tensor (3x2):\n{}\n", ones);
// Create tensor from existing data
let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0];
let from_data = Tensor::from_vec(data, vec![2, 3])?;
println!("From data (2x3):\n{}\n", from_data);
// Check tensor properties
println!("Tensor properties:");
println!(" Shape: {:?}", from_data.shape().dims());
println!(" Number of elements: {}", from_data.numel());
println!(" Data type: {:?}", from_data.dtype());
println!(" Backend: {:?}\n", from_data.backend_type());
Ok(())
}
/// Demonstrates basic arithmetic operations
fn arithmetic_examples() -> Result<()> {
println!("=== Arithmetic Operations ===");
// Create test tensors
let a = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
let b = Tensor::from_vec(vec![5.0, 6.0, 7.0, 8.0], vec![2, 2])?;
println!("Tensor A:\n{}\n", a);
println!("Tensor B:\n{}\n", b);
// Element-wise addition
let sum = &a + &b; // Use references to avoid moving tensors
println!("A + B:\n{}\n", sum);
// Element-wise subtraction
let diff = &a - &b;
println!("A - B:\n{}\n", diff);
// Element-wise multiplication
let product = &a * &b;
println!("A * B (element-wise):\n{}\n", product);
// Element-wise division
let quotient = &a / &b;
println!("A / B:\n{}\n", quotient);
// Chained operations
let complex = ((&a * 2.0) + &b) / 3.0;
println!("(A * 2 + B) / 3:\n{}\n", complex);
Ok(())
}
/// Demonstrates shape manipulation operations
fn shape_manipulation_examples() -> Result<()> {
println!("=== Shape Manipulation ===");
// Create a tensor to manipulate
let tensor = Tensor::from_vec(
vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0],
vec![2, 4]
)?;
println!("Original tensor (2x4):\n{}\n", tensor);
// Reshape to different dimensions
let reshaped = tensor.reshape(vec![4, 2])?;
println!("Reshaped to (4x2):\n{}\n", reshaped);
// Reshape to 1D
let flattened = tensor.reshape(vec![8])?;
println!("Flattened to (8,):\n{}\n", flattened);
// Transpose (2D only)
let matrix = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
let transposed = matrix.transpose()?;
println!("Original matrix:\n{}\n", matrix);
println!("Transposed matrix:\n{}\n", transposed);
// Squeeze and unsqueeze
let with_ones = Tensor::ones(vec![1, 3, 1])?;
println!("Tensor with size-1 dimensions (1x3x1):\n{}\n", with_ones);
let squeezed = with_ones.squeeze(None)?;
println!("Squeezed (removes all size-1 dims):\n{}\n", squeezed);
let unsqueezed = squeezed.unsqueeze(0)?;
println!("Unsqueezed at dimension 0:\n{}\n", unsqueezed);
Ok(())
}
/// Demonstrates data access patterns
fn data_access_examples() -> Result<()> {
println!("=== Data Access ===");
let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
println!("Tensor:\n{}\n", tensor);
// Convert to Vec for external use
let data = tensor.to_vec()?;
println!("As Vec<f32>: {:?}\n", data);
// Reduction operations
let sum_all = tensor.sum(None)?;
println!("Sum of all elements: {}\n", sum_all);
let mean_all = tensor.mean(None)?;
println!("Mean of all elements: {}\n", mean_all);
// Axis-specific reductions
let row_sums = tensor.sum(Some(1))?; // Sum along columns (axis 1)
println!("Row sums (sum along axis 1): {}\n", row_sums);
let col_sums = tensor.sum(Some(0))?; // Sum along rows (axis 0)
println!("Column sums (sum along axis 0): {}\n", col_sums);
Ok(())
}
/// Demonstrates performance characteristics
fn performance_comparison() -> Result<()> {
println!("=== Performance Comparison ===");
// Small tensor operations (CPU should be faster)
let small_a = Tensor::ones(vec![100, 100])?;
let small_b = Tensor::ones(vec![100, 100])?;
let start = Instant::now();
let result = &small_a + &small_b;
let small_time = start.elapsed();
println!("Small tensor (100x100) addition: {:?}", small_time);
// Large tensor operations (GPU might be faster if available)
let large_a = Tensor::ones(vec![1000, 1000])?;
let large_b = Tensor::ones(vec![1000, 1000])?;
let start = Instant::now();
let result = &large_a + &large_b;
let large_time = start.elapsed();
println!("Large tensor (1000x1000) addition: {:?}", large_time);
// Show current backend
println!("Current backend: {:?}", result.backend_type());
// Demonstrate backend conversion (if other backends available)
#[cfg(feature = "wgpu")]
{
println!("\n--- WGPU Backend Comparison ---");
let start = Instant::now();
let wgpu_a = large_a.to_backend(tensor_frame::BackendType::Wgpu)?;
let wgpu_b = large_b.to_backend(tensor_frame::BackendType::Wgpu)?;
let conversion_time = start.elapsed();
let start = Instant::now();
let wgpu_result = &wgpu_a + &wgpu_b;
let _sync = wgpu_result.to_vec()?; // Force synchronization
let wgpu_time = start.elapsed();
println!("WGPU conversion time: {:?}", conversion_time);
println!("WGPU computation time: {:?}", wgpu_time);
println!("Total WGPU time: {:?}", conversion_time + wgpu_time);
}
Ok(())
}
/// Advanced patterns demonstration
fn advanced_patterns() -> Result<()> {
println!("=== Advanced Patterns ===");
// Broadcasting example
let matrix = Tensor::ones(vec![3, 4])?; // Shape: [3, 4]
let vector = Tensor::ones(vec![4])?; // Shape: [4]
let broadcasted = &matrix + &vector; // Result: [3, 4]
println!("Matrix (3x4):\n{}\n", matrix);
println!("Vector (4,):\n{}\n", vector);
println!("Matrix + Vector (broadcasted):\n{}\n", broadcasted);
// Complex broadcasting
let a = Tensor::ones(vec![2, 1, 3])?; // Shape: [2, 1, 3]
let b = Tensor::ones(vec![1, 4, 1])?; // Shape: [1, 4, 1]
let complex_broadcast = &a + &b; // Result: [2, 4, 3]
println!("Complex broadcasting:");
println!("A shape: {:?}", a.shape().dims());
println!("B shape: {:?}", b.shape().dims());
println!("Result shape: {:?}", complex_broadcast.shape().dims());
// Method chaining
let result = Tensor::ones(vec![2, 3])?
.reshape(vec![3, 2])?
.transpose()?;
println!("Method chaining result:\n{}\n", result);
Ok(())
}
/// Error handling examples
fn error_handling_examples() -> Result<()> {
println!("=== Error Handling ===");
// Shape mismatch error
let a = Tensor::ones(vec![2, 3])?;
let b = Tensor::ones(vec![3, 2])?;
match &a + &b {
Ok(result) => println!("Addition succeeded: {}", result),
Err(e) => println!("Expected error - shape mismatch: {}", e),
}
// Invalid reshape error
let tensor = Tensor::ones(vec![2, 3])?; // 6 elements
match tensor.reshape(vec![2, 2]) { // 4 elements - invalid!
Ok(result) => println!("Reshape succeeded: {}", result),
Err(e) => println!("Expected error - invalid reshape: {}", e),
}
// Out of bounds dimension error
match tensor.squeeze(Some(5)) { // Dimension 5 doesn't exist
Ok(result) => println!("Squeeze succeeded: {}", result),
Err(e) => println!("Expected error - invalid dimension: {}", e),
}
Ok(())
}
Key Concepts Demonstrated
1. Tensor Creation
Three primary ways to create tensors:
Tensor::zeros(shape)
- Creates tensor filled with zerosTensor::ones(shape)
- Creates tensor filled with onesTensor::from_vec(data, shape)
- Creates tensor from existing data
2. Reference vs. Owned Operations
// Moves tensors (can only use once)
let result = a + b;
// Uses references (can reuse tensors)
let result = &a + &b;
3. Shape Broadcasting
Tensor Frame automatically broadcasts compatible shapes:
let matrix = Tensor::ones(vec![3, 4])?; // [3, 4]
let vector = Tensor::ones(vec![4])?; // [4] broadcasts to [1, 4]
let result = matrix + vector; // Result: [3, 4]
4. Method Chaining
Operations can be chained for concise code:
let result = tensor
.reshape(vec![4, 2])?
.transpose()?
.squeeze(None)?;
5. Error Handling
All operations return Result<T>
for proper error handling:
match risky_operation() {
Ok(tensor) => process_tensor(tensor),
Err(TensorError::ShapeMismatch { expected, got }) => {
eprintln!("Shape error: expected {:?}, got {:?}", expected, got);
}
Err(e) => eprintln!("Other error: {}", e),
}
Performance Tips
- Use References: Use
&a + &b
instead ofa + b
to avoid unnecessary clones - Batch Operations: Combine operations when possible:
(a * 2.0) + b
vs separate operations - Choose Right Backend: CPU for small tensors, GPU for large operations
- Avoid Frequent Conversions: Stay on one backend when possible
Common Pitfalls
- Shape Mismatches: Ensure compatible shapes for operations
- Invalid Reshapes: New shape must have same total elements
- Backend Overhead: GPU operations have overhead for small tensors
- Memory Usage: Large tensors consume significant memory
Next Steps
After mastering basic operations, explore:
- Broadcasting Examples - Advanced broadcasting patterns
- Backend Selection - Optimizing backend usage
- Performance Guide - Advanced performance optimization
Broadcasting Examples
Broadcasting is one of the most powerful features in Tensor Frame, allowing operations between tensors of different shapes. This guide provides comprehensive examples of broadcasting patterns and best practices.
Broadcasting Rules
Tensor Frame follows NumPy/PyTorch broadcasting rules:
- Alignment: Shapes are compared element-wise from the trailing dimension
- Size 1 Expansion: Dimensions of size 1 are expanded to match
- Missing Dimensions: Missing leading dimensions are treated as size 1
- Compatibility: Dimensions must be either equal, or one must be 1
Basic Broadcasting Examples
Scalar Broadcasting
use tensor_frame::{Tensor, Result};
fn scalar_broadcasting() -> Result<()> {
// Create a base tensor
let tensor = Tensor::from_vec(vec![2.0, 4.0, 6.0, 8.0], vec![2, 2])?;
println!("Original tensor:\n{}\n", tensor);
// Scalar tensor for broadcasting
let scalar = Tensor::from_vec(vec![2.0], vec![])?;
// All operations support broadcasting
let add_result = (tensor.clone() + scalar.clone())?;
println!("Tensor + 2.0:\n{}\n", add_result);
let sub_result = (tensor.clone() - scalar.clone())?;
println!("Tensor - 2.0:\n{}\n", sub_result);
let mul_result = (tensor.clone() * scalar.clone())?;
println!("Tensor * 2.0:\n{}\n", mul_result);
let div_result = (tensor.clone() / scalar.clone())?;
println!("Tensor / 2.0:\n{}\n", div_result);
Ok(())
}
Vector Broadcasting
fn vector_broadcasting() -> Result<()> {
// Matrix-vector operations
let matrix = Tensor::from_vec(
vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
vec![2, 3]
)?;
let vector = Tensor::from_vec(vec![10.0, 20.0, 30.0], vec![3])?;
println!("Matrix (2x3):\n{}\n", matrix);
println!("Vector (3,):\n{}\n", vector);
// All arithmetic operations support broadcasting
let add_result = (matrix.clone() + vector.clone())?;
println!("Matrix + Vector:\n{}\n", add_result);
let mul_result = (matrix.clone() * vector.clone())?;
println!("Matrix * Vector (element-wise):\n{}\n", mul_result);
// Row vector broadcasting
let row_vector = Tensor::from_vec(vec![100.0, 200.0], vec![2, 1])?;
let row_add = (matrix.clone() + row_vector.clone())?;
let row_sub = (matrix.clone() - row_vector.clone())?;
println!("Matrix + Row Vector (2x1):\n{}\n", row_add);
println!("Matrix - Row Vector (2x1):\n{}\n", row_sub);
// Complex broadcasting example
let a = Tensor::from_vec(vec![10.0, 20.0], vec![2, 1])?;
let b = Tensor::from_vec(vec![1.0, 2.0, 3.0], vec![1, 3])?;
let complex_result = (a / b)?; // Broadcasting: [2,1] / [1,3] -> [2,3]
println!("Complex broadcasting [2,1] / [1,3]:\n{}\n", complex_result);
Ok(())
}
Advanced Broadcasting Patterns
Multi-dimensional Broadcasting
fn multidimensional_broadcasting() -> Result<()> {
// 3D tensor broadcasting
let tensor_3d = Tensor::ones(vec![2, 3, 4])?; // Shape: [2, 3, 4]
let tensor_2d = Tensor::ones(vec![3, 4])?; // Shape: [3, 4]
let tensor_1d = Tensor::ones(vec![4])?; // Shape: [4]
println!("3D tensor shape: {:?}", tensor_3d.shape().dims());
println!("2D tensor shape: {:?}", tensor_2d.shape().dims());
println!("1D tensor shape: {:?}", tensor_1d.shape().dims());
// 3D + 2D broadcasting: [2,3,4] + [3,4] -> [2,3,4]
let result_3d_2d = &tensor_3d + &tensor_2d;
println!("3D + 2D result shape: {:?}", result_3d_2d.shape().dims());
// 3D + 1D broadcasting: [2,3,4] + [4] -> [2,3,4]
let result_3d_1d = &tensor_3d + &tensor_1d;
println!("3D + 1D result shape: {:?}", result_3d_1d.shape().dims());
// Complex multi-dimensional broadcasting
let a = Tensor::ones(vec![1, 3, 1])?; // Shape: [1, 3, 1]
let b = Tensor::ones(vec![2, 1, 4])?; // Shape: [2, 1, 4]
let complex_result = &a + &b; // Result: [2, 3, 4]
println!("Complex broadcasting:");
println!(" A shape: {:?}", a.shape().dims());
println!(" B shape: {:?}", b.shape().dims());
println!(" Result shape: {:?}", complex_result.shape().dims());
Ok(())
}
Broadcasting with Size-1 Dimensions
fn size_one_broadcasting() -> Result<()> {
// Different ways to create broadcastable tensors
let base = Tensor::from_vec(
vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
vec![2, 3]
)?;
// Row broadcasting (1 x N)
let row_broadcast = Tensor::from_vec(vec![10.0, 20.0, 30.0], vec![1, 3])?;
let row_result = &base + &row_broadcast;
println!("Row broadcasting [2,3] + [1,3]:\n{}\n", row_result);
// Column broadcasting (N x 1)
let col_broadcast = Tensor::from_vec(vec![100.0, 200.0], vec![2, 1])?;
let col_result = &base + &col_broadcast;
println!("Column broadcasting [2,3] + [2,1]:\n{}\n", col_result);
// Both dimensions broadcast (1 x 1)
let scalar_as_tensor = Tensor::from_vec(vec![1000.0], vec![1, 1])?;
let scalar_result = &base + &scalar_as_tensor;
println!("Scalar broadcasting [2,3] + [1,1]:\n{}\n", scalar_result);
Ok(())
}
Broadcasting in Practice
Machine Learning Patterns
fn ml_broadcasting_patterns() -> Result<()> {
// Batch normalization pattern
let batch_data = Tensor::ones(vec![32, 128])?; // 32 samples, 128 features
let mean = Tensor::zeros(vec![128])?; // Feature means
let std = Tensor::ones(vec![128])?; // Feature standard deviations
// Normalize: (x - mean) / std
let normalized = (&batch_data - &mean) / &std;
println!("Batch normalization result shape: {:?}", normalized.shape().dims());
// Bias addition pattern
let linear_output = Tensor::ones(vec![32, 10])?; // Batch size 32, 10 classes
let bias = Tensor::from_vec(
vec![0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
vec![10]
)?;
let biased_output = &linear_output + &bias;
println!("Bias addition result shape: {:?}", biased_output.shape().dims());
// Attention score broadcasting
let queries = Tensor::ones(vec![32, 8, 64])?; // [batch, heads, dim]
let attention_weights = Tensor::ones(vec![32, 8, 1])?; // [batch, heads, 1]
let weighted_queries = &queries * &attention_weights;
println!("Attention weighting result shape: {:?}", weighted_queries.shape().dims());
Ok(())
}
Image Processing Patterns
fn image_broadcasting_patterns() -> Result<()> {
// Image batch processing
let images = Tensor::ones(vec![4, 3, 224, 224])?; // [batch, channels, height, width]
// Channel-wise normalization
let channel_mean = Tensor::from_vec(
vec![0.485, 0.456, 0.406], // ImageNet means
vec![1, 3, 1, 1]
)?;
let channel_std = Tensor::from_vec(
vec![0.229, 0.224, 0.225], // ImageNet stds
vec![1, 3, 1, 1]
)?;
let normalized_images = (&images - &channel_mean) / &channel_std;
println!("Image normalization result shape: {:?}", normalized_images.shape().dims());
// Pixel-wise operations
let brightness_adjustment = Tensor::from_vec(vec![0.1], vec![1, 1, 1, 1])?;
let brightened = &images + &brightness_adjustment;
println!("Brightness adjustment result shape: {:?}", brightened.shape().dims());
Ok(())
}
Performance Considerations
Efficient Broadcasting
use std::time::Instant;
fn broadcasting_performance() -> Result<()> {
// Efficient: Broadcasting avoids large intermediate tensors
let large_matrix = Tensor::ones(vec![1000, 1000])?;
let small_vector = Tensor::ones(vec![1000])?;
let start = Instant::now();
let efficient_result = &large_matrix + &small_vector; // Broadcasting
let efficient_time = start.elapsed();
println!("Efficient broadcasting: {:?}", efficient_time);
// Less efficient: Explicit expansion (don't do this!)
let start = Instant::now();
let expanded_vector = small_vector.reshape(vec![1, 1000])?;
// Note: This would need manual tiling which isn't implemented
// let manual_result = &large_matrix + &expanded_vector;
let manual_time = start.elapsed();
println!("Manual expansion overhead: {:?}", manual_time);
Ok(())
}
Memory-Efficient Patterns
fn memory_efficient_broadcasting() -> Result<()> {
// Good: Broadcasting reuses memory
let data = Tensor::ones(vec![1000, 500])?;
let scale_factor = Tensor::from_vec(vec![2.0], vec![1])?;
let scaled = &data * &scale_factor; // Memory efficient
// Avoid: Creating large intermediate tensors
// let large_scale = scale_factor.broadcast_to(vec![1000, 500])?; // Wasteful
// let scaled = &data * &large_scale;
println!("Memory-efficient scaling completed");
Ok(())
}
Common Broadcasting Errors
Shape Incompatibility
fn broadcasting_errors() -> Result<()> {
// These will fail - incompatible shapes
let a = Tensor::ones(vec![3, 4])?;
let b = Tensor::ones(vec![2, 4])?; // Different first dimension, not 1
match &a + &b {
Ok(_) => println!("Unexpected success"),
Err(e) => println!("Expected error - incompatible shapes: {}", e),
}
// These will work - compatible shapes
let c = Tensor::ones(vec![1, 4])?; // First dimension is 1
let success = &a + &c;
println!("Compatible shapes work: {:?}", success.shape().dims());
Ok(())
}
Broadcasting Visualization
Understanding Shape Alignment
fn visualize_broadcasting() -> Result<()> {
println!("Broadcasting visualization:");
println!();
// Example 1: [2, 3] + [3]
println!("Example 1: [2, 3] + [3]");
println!(" A: [2, 3]");
println!(" B: [3] -> [1, 3] (implicit leading 1)");
println!(" Result: [2, 3]");
println!();
// Example 2: [4, 1, 5] + [3, 5]
println!("Example 2: [4, 1, 5] + [3, 5]");
println!(" A: [4, 1, 5]");
println!(" B: [3, 5] -> [1, 3, 5] (implicit leading 1)");
println!(" Result: [4, 3, 5] (1 broadcasts to 3, 4)");
println!();
// Example 3: Incompatible
println!("Example 3: [3, 4] + [2, 4] - INCOMPATIBLE");
println!(" A: [3, 4]");
println!(" B: [2, 4]");
println!(" Error: 3 and 2 cannot broadcast (neither is 1)");
println!();
Ok(())
}
Best Practices
1. Design for Broadcasting
// Good: Design tensors with broadcasting in mind
let batch_size = 32;
let features = 128;
let data = Tensor::ones(vec![batch_size, features])?;
let weights = Tensor::ones(vec![features])?; // Broadcastable
let bias = Tensor::ones(vec![features])?; // Broadcastable
let output = (&data * &weights) + &bias; // Clean broadcasting
2. Use Explicit Shapes
// Better: Be explicit about intended broadcasting
let matrix = Tensor::ones(vec![10, 20])?;
let row_vector = Tensor::ones(vec![1, 20])?; // Explicit [1, 20]
let col_vector = Tensor::ones(vec![10, 1])?; // Explicit [10, 1]
let row_broadcast = &matrix + &row_vector;
let col_broadcast = &matrix + &col_vector;
3. Document Broadcasting Intent
/// Applies per-channel normalization to image batch
///
/// # Arguments
/// * `images` - Shape [batch, channels, height, width]
/// * `channel_stats` - Shape [1, channels, 1, 1] for broadcasting
fn normalize_images(images: &Tensor, channel_stats: &Tensor) -> Result<Tensor> {
// Broadcasting: [B,C,H,W] - [1,C,1,1] -> [B,C,H,W]
images - channel_stats
}
4. Validate Shapes Early
fn safe_broadcast_operation(a: &Tensor, b: &Tensor) -> Result<Tensor> {
// Check compatibility before expensive operations
let a_shape = a.shape().dims();
let b_shape = b.shape().dims();
// Custom validation logic here
if !shapes_are_broadcastable(a_shape, b_shape) {
return Err(TensorError::ShapeMismatch {
expected: a_shape.to_vec(),
got: b_shape.to_vec(),
});
}
// Proceed with operation
a + b
}
fn shapes_are_broadcastable(a: &[usize], b: &[usize]) -> bool {
let max_len = a.len().max(b.len());
for i in 0..max_len {
let a_dim = a.get(a.len().saturating_sub(max_len - i)).unwrap_or(&1);
let b_dim = b.get(b.len().saturating_sub(max_len - i)).unwrap_or(&1);
if *a_dim != *b_dim && *a_dim != 1 && *b_dim != 1 {
return false;
}
}
true
}
Next Steps
After mastering broadcasting:
- Custom Backends - Optimize broadcasting for different backends
- Performance Guide - Advanced broadcasting optimization
- API Reference - Detailed operation specifications
Custom Backend Examples
This guide demonstrates how to effectively use different computational backends in Tensor Frame, including when to switch backends, performance optimization strategies, and mixed backend workflows.
Backend Selection Strategies
Automatic vs Manual Selection
use tensor_frame::{Tensor, BackendType, Result};
use std::time::Instant;
fn backend_selection_demo() -> Result<()> {
println!("=== Backend Selection Strategies ===\n");
// Automatic selection (recommended for most cases)
let auto_tensor = Tensor::zeros(vec![1000, 1000])?;
println!("Automatic backend selected: {:?}", auto_tensor.backend_type());
// Manual backend specification
let cpu_tensor = auto_tensor.to_backend(BackendType::Cpu)?;
println!("Forced CPU backend: {:?}", cpu_tensor.backend_type());
#[cfg(feature = "wgpu")]
{
match auto_tensor.to_backend(BackendType::Wgpu) {
Ok(wgpu_tensor) => {
println!("WGPU backend available: {:?}", wgpu_tensor.backend_type());
}
Err(e) => {
println!("WGPU backend not available: {}", e);
}
}
}
#[cfg(feature = "cuda")]
{
match auto_tensor.to_backend(BackendType::Cuda) {
Ok(cuda_tensor) => {
println!("CUDA backend available: {:?}", cuda_tensor.backend_type());
}
Err(e) => {
println!("CUDA backend not available: {}", e);
}
}
}
Ok(())
}
Size-Based Backend Selection
fn adaptive_backend_selection() -> Result<()> {
println!("=== Adaptive Backend Selection ===\n");
let sizes = vec![
(vec![10, 10], "tiny"),
(vec![100, 100], "small"),
(vec![1000, 1000], "medium"),
(vec![3000, 3000], "large"),
];
for (shape, description) in sizes {
let elements = shape.iter().product::<usize>();
// Choose backend based on tensor size
let backend = if elements < 1000 {
BackendType::Cpu // CPU overhead minimal for small tensors
} else if elements < 1_000_000 {
// Try WGPU first, fallback to CPU
#[cfg(feature = "wgpu")]
{ BackendType::Wgpu }
#[cfg(not(feature = "wgpu"))]
{ BackendType::Cpu }
} else {
// Large tensors: prefer CUDA > WGPU > CPU
#[cfg(feature = "cuda")]
{ BackendType::Cuda }
#[cfg(all(feature = "wgpu", not(feature = "cuda")))]
{ BackendType::Wgpu }
#[cfg(all(not(feature = "wgpu"), not(feature = "cuda")))]
{ BackendType::Cpu }
};
let tensor = Tensor::zeros(shape.clone())?;
let optimized_tensor = tensor.to_backend(backend)?;
println!("{} tensor {:?}: {} elements -> {:?} backend",
description, shape, elements, optimized_tensor.backend_type());
}
Ok(())
}
Performance Benchmarking
Backend Performance Comparison
fn benchmark_backends() -> Result<()> {
println!("=== Backend Performance Comparison ===\n");
let sizes = vec![
vec![100, 100],
vec![500, 500],
vec![1000, 1000],
vec![2000, 2000],
];
for size in sizes {
println!("Benchmarking {}x{} matrix addition:", size[0], size[1]);
// Create test tensors
let a = Tensor::ones(size.clone())?;
let b = Tensor::ones(size.clone())?;
// CPU benchmark
let cpu_a = a.to_backend(BackendType::Cpu)?;
let cpu_b = b.to_backend(BackendType::Cpu)?;
let start = Instant::now();
let cpu_result = &cpu_a + &cpu_b;
let cpu_time = start.elapsed();
println!(" CPU: {:?}", cpu_time);
// WGPU benchmark (if available)
#[cfg(feature = "wgpu")]
{
match (a.to_backend(BackendType::Wgpu), b.to_backend(BackendType::Wgpu)) {
(Ok(wgpu_a), Ok(wgpu_b)) => {
let start = Instant::now();
let wgpu_result = &wgpu_a + &wgpu_b;
// Force synchronization by converting back
let _sync = wgpu_result.to_vec()?;
let wgpu_time = start.elapsed();
let speedup = cpu_time.as_nanos() as f64 / wgpu_time.as_nanos() as f64;
println!(" WGPU: {:?} ({}x speedup)", wgpu_time, speedup);
}
_ => println!(" WGPU: Not available"),
}
}
// CUDA benchmark (if available)
#[cfg(feature = "cuda")]
{
match (a.to_backend(BackendType::Cuda), b.to_backend(BackendType::Cuda)) {
(Ok(cuda_a), Ok(cuda_b)) => {
let start = Instant::now();
let cuda_result = &cuda_a + &cuda_b;
let _sync = cuda_result.to_vec()?;
let cuda_time = start.elapsed();
let speedup = cpu_time.as_nanos() as f64 / cuda_time.as_nanos() as f64;
println!(" CUDA: {:?} ({}x speedup)", cuda_time, speedup);
}
_ => println!(" CUDA: Not available"),
}
}
println!();
}
Ok(())
}
Operation-Specific Benchmarks
fn operation_benchmarks() -> Result<()> {
println!("=== Operation-Specific Benchmarks ===\n");
let size = vec![1000, 1000];
let a = Tensor::ones(size.clone())?;
let b = Tensor::ones(size.clone())?;
// Test different operations
let operations = vec![
("Addition", |a: &Tensor, b: &Tensor| a + b),
("Multiplication", |a: &Tensor, b: &Tensor| a * b),
("Complex", |a: &Tensor, b: &Tensor| (a * 2.0) + b),
];
for (op_name, operation) in operations {
println!("Operation: {}", op_name);
// CPU timing
let cpu_a = a.to_backend(BackendType::Cpu)?;
let cpu_b = b.to_backend(BackendType::Cpu)?;
let start = Instant::now();
let _cpu_result = operation(&cpu_a, &cpu_b)?;
let cpu_time = start.elapsed();
println!(" CPU: {:?}", cpu_time);
// GPU timing (if available)
#[cfg(feature = "wgpu")]
{
if let (Ok(gpu_a), Ok(gpu_b)) = (
a.to_backend(BackendType::Wgpu),
b.to_backend(BackendType::Wgpu)
) {
let start = Instant::now();
let gpu_result = operation(&gpu_a, &gpu_b)?;
let _sync = gpu_result.to_vec()?; // Force sync
let gpu_time = start.elapsed();
let speedup = cpu_time.as_nanos() as f64 / gpu_time.as_nanos() as f64;
println!(" GPU: {:?} ({}x speedup)", gpu_time, speedup);
}
}
println!();
}
Ok(())
}
Mixed Backend Workflows
Pipeline with Backend Transitions
fn mixed_backend_pipeline() -> Result<()> {
println!("=== Mixed Backend Pipeline ===\n");
// Stage 1: Data preparation on CPU (I/O intensive)
println!("Stage 1: Data preparation on CPU");
let raw_data = vec![1.0; 1_000_000]; // Simulate data loading
let cpu_tensor = Tensor::from_vec(raw_data, vec![1000, 1000])?;
println!(" Created tensor on CPU: {:?}", cpu_tensor.backend_type());
// Stage 2: Heavy computation on GPU
#[cfg(feature = "wgpu")]
{
println!("Stage 2: Moving to GPU for computation");
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
println!(" Moved to GPU: {:?}", gpu_tensor.backend_type());
// Perform heavy computations on GPU
let processed = (&gpu_tensor * 2.0) + 1.0;
let normalized = &processed / processed.sum(None)?;
println!(" Completed GPU computations");
// Stage 3: Results back to CPU for output
println!("Stage 3: Moving results back to CPU");
let final_result = normalized.to_backend(BackendType::Cpu)?;
println!(" Final result on CPU: {:?}", final_result.backend_type());
// Stage 4: Extract specific values (CPU efficient)
let summary = final_result.sum(None)?;
println!(" Summary value: {}", summary.to_vec()?[0]);
}
#[cfg(not(feature = "wgpu"))]
{
println!("Stage 2-4: Processing on CPU (GPU not available)");
let processed = (&cpu_tensor * 2.0) + 1.0;
let summary = processed.sum(None)?;
println!(" Summary value: {}", summary.to_vec()?[0]);
}
Ok(())
}
Batch Processing Strategy
fn batch_processing_strategy() -> Result<()> {
println!("=== Batch Processing Strategy ===\n");
// Simulate multiple data batches
let batch_sizes = vec![100, 500, 1000, 2000];
for batch_size in batch_sizes {
println!("Processing batch size: {}", batch_size);
// Create multiple tensors (simulating data batches)
let batches: Result<Vec<_>> = (0..5)
.map(|i| {
let data = vec![i as f32; batch_size * batch_size];
Tensor::from_vec(data, vec![batch_size, batch_size])
})
.collect();
let batches = batches?;
// Choose optimal backend based on batch size
let backend = if batch_size < 500 {
BackendType::Cpu
} else {
#[cfg(feature = "wgpu")]
{ BackendType::Wgpu }
#[cfg(not(feature = "wgpu"))]
{ BackendType::Cpu }
};
let start = Instant::now();
// Convert all batches to optimal backend
let gpu_batches: Result<Vec<_>> = batches
.into_iter()
.map(|batch| batch.to_backend(backend))
.collect();
let gpu_batches = gpu_batches?;
// Process all batches
let results: Result<Vec<_>> = gpu_batches
.iter()
.map(|batch| batch.sum(None))
.collect();
let results = results?;
let processing_time = start.elapsed();
println!(" Backend: {:?}", backend);
println!(" Processing time: {:?}", processing_time);
println!(" Results count: {}", results.len());
println!();
}
Ok(())
}
Error Handling and Fallback Strategies
Robust Backend Selection
fn robust_backend_selection(tensor: Tensor) -> Result<Tensor> {
// Try backends in order of preference
let backends_to_try = vec![
#[cfg(feature = "cuda")]
BackendType::Cuda,
#[cfg(feature = "wgpu")]
BackendType::Wgpu,
BackendType::Cpu,
];
for backend in backends_to_try {
match tensor.to_backend(backend) {
Ok(converted_tensor) => {
println!("Successfully using backend: {:?}", backend);
return Ok(converted_tensor);
}
Err(e) => {
println!("Backend {:?} failed: {}", backend, e);
continue;
}
}
}
// This should never happen since CPU should always work
Err(tensor_frame::TensorError::BackendError(
"No backend available".to_string()
))
}
fn robust_operation_with_fallback() -> Result<()> {
println!("=== Robust Operation with Fallback ===\n");
let large_tensor = Tensor::ones(vec![2000, 2000])?;
// Try GPU operation first
let result = match large_tensor.to_backend(BackendType::Wgpu) {
Ok(gpu_tensor) => {
match gpu_tensor.sum(None) {
Ok(result) => {
println!("GPU operation successful");
result
}
Err(e) => {
println!("GPU operation failed: {}, falling back to CPU", e);
large_tensor.to_backend(BackendType::Cpu)?.sum(None)?
}
}
}
Err(e) => {
println!("GPU conversion failed: {}, using CPU", e);
large_tensor.sum(None)?
}
};
println!("Final result: {}", result.to_vec()?[0]);
Ok(())
}
Memory Management Across Backends
fn memory_management_demo() -> Result<()> {
println!("=== Memory Management Across Backends ===\n");
// Monitor memory usage pattern
let tensor_size = vec![1000, 1000]; // 4MB tensor
// Start with CPU
let cpu_tensor = Tensor::ones(tensor_size.clone())?;
println!("Created tensor on CPU");
// Convert to GPU (allocates GPU memory)
#[cfg(feature = "wgpu")]
{
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
println!("Converted to GPU (both CPU and GPU memory used)");
// Process on GPU
let gpu_result = (&gpu_tensor * 2.0) + 1.0;
println!("Processed on GPU");
// Convert back to CPU (allocates new CPU memory)
let final_result = gpu_result.to_backend(BackendType::Cpu)?;
println!("Converted back to CPU");
// At this point: original CPU tensor, GPU tensor, and final CPU tensor exist
// Memory is automatically freed when variables go out of scope
let summary = final_result.sum(None)?;
println!("Final summary: {}", summary.to_vec()?[0]);
}
println!("Memory automatically freed when variables go out of scope");
Ok(())
}
Production Patterns
Configuration-Driven Backend Selection
use std::env;
#[derive(Debug)]
struct TensorConfig {
preferred_backend: BackendType,
fallback_backends: Vec<BackendType>,
small_tensor_threshold: usize,
}
impl TensorConfig {
fn from_env() -> Self {
let preferred = env::var("TENSOR_BACKEND")
.unwrap_or_else(|_| "auto".to_string());
let preferred_backend = match preferred.as_str() {
"cpu" => BackendType::Cpu,
#[cfg(feature = "wgpu")]
"wgpu" => BackendType::Wgpu,
#[cfg(feature = "cuda")]
"cuda" => BackendType::Cuda,
_ => {
// Auto-select best available
#[cfg(feature = "cuda")]
{ BackendType::Cuda }
#[cfg(all(feature = "wgpu", not(feature = "cuda")))]
{ BackendType::Wgpu }
#[cfg(all(not(feature = "wgpu"), not(feature = "cuda")))]
{ BackendType::Cpu }
}
};
let threshold = env::var("SMALL_TENSOR_THRESHOLD")
.unwrap_or_else(|_| "10000".to_string())
.parse()
.unwrap_or(10000);
TensorConfig {
preferred_backend,
fallback_backends: vec![BackendType::Cpu], // Always fallback to CPU
small_tensor_threshold: threshold,
}
}
fn select_backend(&self, tensor_size: usize) -> BackendType {
if tensor_size < self.small_tensor_threshold {
BackendType::Cpu // Always use CPU for small tensors
} else {
self.preferred_backend
}
}
}
fn production_backend_usage() -> Result<()> {
println!("=== Production Backend Usage ===\n");
let config = TensorConfig::from_env();
println!("Configuration: {:?}", config);
// Use configuration for tensor operations
let sizes = vec![100, 1000, 10000, 100000];
for size in sizes {
let tensor = Tensor::ones(vec![size])?;
let elements = tensor.numel();
let backend = config.select_backend(elements);
let optimized_tensor = tensor.to_backend(backend)?;
println!("Tensor size {}: using {:?} backend",
elements, optimized_tensor.backend_type());
}
Ok(())
}
Application-Level Backend Strategy
struct TensorApplication {
config: TensorConfig,
}
impl TensorApplication {
fn new() -> Self {
Self {
config: TensorConfig::from_env(),
}
}
fn process_data(&self, data: Vec<f32>, shape: Vec<usize>) -> Result<Tensor> {
// Create tensor
let tensor = Tensor::from_vec(data, shape)?;
// Select optimal backend
let backend = self.config.select_backend(tensor.numel());
let optimized_tensor = tensor.to_backend(backend)?;
// Perform operations
let processed = (&optimized_tensor * 2.0) + 1.0;
let normalized = &processed / processed.sum(None)?;
Ok(normalized)
}
fn batch_process(&self, batches: Vec<Vec<f32>>, shape: Vec<usize>) -> Result<Vec<Tensor>> {
batches
.into_iter()
.map(|batch| self.process_data(batch, shape.clone()))
.collect()
}
}
Best Practices Summary
1. Size-Based Selection
- Small tensors (< 10K elements): Use CPU backend
- Medium tensors (10K - 1M elements): Consider WGPU
- Large tensors (> 1M elements): Prefer CUDA > WGPU > CPU
2. Operation-Based Selection
- I/O operations: Use CPU backend
- Element-wise operations: Use GPU backends for large tensors
- Reductions: GPU effective for very large tensors
- Large reductions: CUDA > CPU > WGPU (until WGPU reductions implemented)
3. Memory Management
- Convert to target backend early in pipeline
- Avoid frequent backend conversions
- Use batch processing when possible
- Monitor memory usage in production
4. Error Handling
- Always provide CPU fallback
- Handle backend-specific errors gracefully
- Use configuration for backend preferences
- Test with all available backends
5. Performance Optimization
- Benchmark with your specific workload
- Consider warmup time for GPU backends
- Profile memory transfer overhead
- Use appropriate tensor sizes for each backend
Next Steps
- Performance Guide - Advanced optimization techniques
- API Reference - Detailed backend API documentation
- Backend-Specific Guides - Deep dives into each backend
Performance Guide
This guide provides detailed information on optimizing Tensor Frame performance across different backends and use cases.
Performance Overview
Tensor Frame's performance characteristics vary significantly based on:
- Tensor size: Small vs large tensors have different optimal backends
- Operation type: Element-wise vs reductions vs matrix operations
- Backend selection: CPU vs WGPU vs CUDA performance profiles
- Memory patterns: Data locality and transfer overhead
Backend Performance Characteristics
CPU Backend
- Best for: Small tensors (< 10K elements), development, guaranteed availability
- Strengths: Low latency, no setup overhead, excellent debugging
- Limitations: Limited parallelism, memory bandwidth bound for large operations
use tensor_frame::Tensor;
// CPU optimal: Small tensors and scalar operations
let small = Tensor::ones(vec![100, 100])?;
let result = small.sum(None)?; // ~0.1ms on modern CPU
WGPU Backend
- Best for: Large element-wise operations (> 100K elements), cross-platform deployment
- Strengths: Massive parallelism, good memory bandwidth, portable
- Limitations: GPU setup overhead (~1-10ms), limited operation support
use tensor_frame::Tensor;
// WGPU optimal: Large parallel operations
let large = Tensor::ones(vec![2048, 2048])?
.to_backend(BackendType::Wgpu)?;
let result = (large_a * large_b) + large_c; // ~2ms on modern GPU
CUDA Backend
- Best for: Very large operations (> 1M elements), production workloads
- Strengths: Peak performance, mature optimizations, cuBLAS integration
- Limitations: NVIDIA-only, CUDA toolkit requirement
use tensor_frame::Tensor;
// CUDA optimal: Matrix operations and very large tensors
let matrices = Tensor::ones(vec![4096, 4096])?
.to_backend(BackendType::Cuda)?;
let result = matrix_a.matmul(&matrix_b)?; // ~15ms with cuBLAS
Operation-Specific Performance
Element-wise Operations
Performance Scaling:
- CPU: O(n) with thread-level parallelism (8-32 threads)
- WGPU: O(n) with massive parallelism (1000+ threads)
- CUDA: O(n) with optimal parallelism (10000+ threads)
use std::time::Instant;
fn benchmark_element_wise() -> Result<()> {
let sizes = vec![1000, 5000, 10000, 50000];
for size in sizes {
let a = Tensor::ones(vec![size, size])?;
let b = Tensor::ones(vec![size, size])?;
// CPU timing
let start = Instant::now();
let cpu_result = &a + &b;
let cpu_time = start.elapsed();
// GPU timing (if available)
#[cfg(feature = "wgpu")]
{
let gpu_a = a.to_backend(BackendType::Wgpu)?;
let gpu_b = b.to_backend(BackendType::Wgpu)?;
let start = Instant::now();
let gpu_result = &gpu_a + &gpu_b;
let _sync = gpu_result.to_vec()?;
let gpu_time = start.elapsed();
let speedup = cpu_time.as_nanos() as f64 / gpu_time.as_nanos() as f64;
println!("Size {}x{}: CPU {:?}, GPU {:?}, Speedup: {:.1}x",
size, size, cpu_time, gpu_time, speedup);
}
}
Ok(())
}
Reduction Operations
Performance Notes:
- CPU: Rayon parallel reduction, cache-efficient
- GPU: Requires multiple kernel launches for large reductions
- Memory-bound for large tensors
fn reduction_performance() -> Result<()> {
let tensor = Tensor::ones(vec![10000, 10000])?; // 100M elements
// Sum reduction timing
let start = Instant::now();
let sum = tensor.sum(None)?;
let cpu_time = start.elapsed();
println!("CPU sum reduction (100M elements): {:?}", cpu_time);
println!("Result: {}", sum.to_vec()?[0]);
Ok(())
}
Memory Performance
Memory Transfer Costs
GPU operations include memory transfer overhead:
fn memory_transfer_analysis() -> Result<()> {
let sizes = vec![1000, 5000, 10000];
for size in sizes {
let tensor = Tensor::ones(vec![size, size])?;
let elements = tensor.numel();
let bytes = elements * 4; // f32 = 4 bytes
#[cfg(feature = "wgpu")]
{
// Time conversion to GPU
let start = Instant::now();
let gpu_tensor = tensor.to_backend(BackendType::Wgpu)?;
let upload_time = start.elapsed();
// Time conversion back to CPU
let start = Instant::now();
let _data = gpu_tensor.to_vec()?;
let download_time = start.elapsed();
let upload_bw = bytes as f64 / upload_time.as_secs_f64() / 1e9; // GB/s
let download_bw = bytes as f64 / download_time.as_secs_f64() / 1e9; // GB/s
println!("Size {}x{} ({} MB):", size, size, bytes / 1024 / 1024);
println!(" Upload: {:?} ({:.1} GB/s)", upload_time, upload_bw);
println!(" Download: {:?} ({:.1} GB/s)", download_time, download_bw);
}
}
Ok(())
}
Memory Layout Optimization
// Efficient: Contiguous memory access
let matrix = Tensor::from_vec(data, vec![rows, cols])?;
let transposed = matrix.transpose()?; // May require memory copy
// Efficient: Operations that preserve layout
let result = (&matrix_a + &matrix_b) * 2.0; // All operations maintain layout
// Less efficient: Operations that break layout
let reshaped = matrix.reshape(vec![cols, rows])?; // May require copy
Optimization Strategies
1. Backend Selection Strategy
fn optimal_backend_for_workload(tensor_size: usize, operation: &str) -> BackendType {
match (tensor_size, operation) {
// Small tensors: CPU always optimal
(0..=10_000, _) => BackendType::Cpu,
// Large reductions: Prefer CUDA
(_, "reduction") if tensor_size > 1_000_000 => {
#[cfg(feature = "cuda")]
{ BackendType::Cuda }
#[cfg(not(feature = "cuda"))]
{ BackendType::Cpu }
}
// Large element-wise: GPU beneficial
(10_001..=1_000_000, "elementwise") => {
#[cfg(feature = "wgpu")]
{ BackendType::Wgpu }
#[cfg(not(feature = "wgpu"))]
{ BackendType::Cpu }
}
// Very large: Prefer CUDA > WGPU > CPU
(1_000_001.., _) => {
#[cfg(feature = "cuda")]
{ BackendType::Cuda }
#[cfg(all(feature = "wgpu", not(feature = "cuda")))]
{ BackendType::Wgpu }
#[cfg(all(not(feature = "wgpu"), not(feature = "cuda")))]
{ BackendType::Cpu }
}
// Default: CPU
_ => BackendType::Cpu,
}
}
2. Operation Fusion
// Efficient: Fused operations
let result = ((a * b) + c) / d; // Single expression, potential fusion
// Less efficient: Separate operations
let temp1 = a * b;
let temp2 = temp1 + c;
let result = temp2 / d; // Multiple temporary allocations
3. Batch Processing
fn efficient_batch_processing(batches: Vec<Tensor>) -> Result<Vec<Tensor>> {
// Convert all to same backend once
let backend = BackendType::Wgpu;
let gpu_batches: Result<Vec<_>> = batches
.into_iter()
.map(|t| t.to_backend(backend))
.collect();
// Process on GPU
gpu_batches?
.into_iter()
.map(|batch| {
// Heavy computation on GPU
(batch * 2.0) + 1.0
})
.collect()
}
4. Memory Pool Usage
// Efficient: Reuse similar-sized tensors
struct TensorPool {
cached_tensors: HashMap<Vec<usize>, Vec<Tensor>>,
}
impl TensorPool {
fn get_or_create(&mut self, shape: Vec<usize>) -> Result<Tensor> {
if let Some(cached) = self.cached_tensors.get_mut(&shape) {
if let Some(tensor) = cached.pop() {
return Ok(tensor);
}
}
// Create new tensor if no cached version
Tensor::zeros(shape)
}
fn return_tensor(&mut self, tensor: Tensor) {
let shape = tensor.shape().dims().to_vec();
self.cached_tensors
.entry(shape)
.or_insert_with(Vec::new)
.push(tensor);
}
}
Profiling and Debugging
CPU Profiling
// Use built-in timing
use std::time::Instant;
let start = Instant::now();
let result = expensive_operation()?;
println!("Operation took: {:?}", start.elapsed());
// Use external profilers
// cargo install flamegraph
// cargo flamegraph --bin your_app
GPU Profiling
NVIDIA Tools (for CUDA backend):
# Nsight Systems for timeline analysis
nsys profile --stats=true ./your_app
# Nsight Compute for kernel analysis
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./your_app
Platform Tools (for WGPU backend):
- Windows: PIX for Windows, RenderDoc
- macOS: Xcode Instruments (GPU Timeline)
- Linux: RenderDoc, Vulkan Tools
Memory Profiling
fn memory_usage_analysis() -> Result<()> {
use std::alloc::{GlobalAlloc, Layout, System};
// Monitor system memory usage
#[cfg(target_os = "linux")]
{
use std::fs;
let status = fs::read_to_string("/proc/self/status")?;
for line in status.lines() {
if line.starts_with("VmRSS:") {
println!("Memory usage: {}", line);
}
}
}
// GPU memory monitoring (platform-specific)
#[cfg(feature = "cuda")]
{
// CUDA memory info
let (free, total) = cuda::memory_info()?;
println!("GPU memory: {} MB free of {} MB total",
free / 1024 / 1024, total / 1024 / 1024);
}
Ok(())
}
Performance Benchmarking
Comprehensive Benchmark Suite
use criterion::{criterion_group, criterion_main, Criterion};
fn bench_tensor_operations(c: &mut Criterion) {
let sizes = vec![100, 500, 1000, 2000];
for size in sizes {
let a = Tensor::ones(vec![size, size]).unwrap();
let b = Tensor::ones(vec![size, size]).unwrap();
// CPU benchmark
c.bench_function(&format!("cpu_add_{}x{}", size, size), |bench| {
bench.iter(|| {
let _result = &a + &b;
});
});
// GPU benchmark (if available)
#[cfg(feature = "wgpu")]
{
let gpu_a = a.to_backend(BackendType::Wgpu).unwrap();
let gpu_b = b.to_backend(BackendType::Wgpu).unwrap();
c.bench_function(&format!("gpu_add_{}x{}", size, size), |bench| {
bench.iter(|| {
let result = &gpu_a + &gpu_b;
let _sync = result.to_vec().unwrap(); // Force sync
});
});
}
}
}
criterion_group!(benches, bench_tensor_operations);
criterion_main!(benches);
Performance Troubleshooting
Common Performance Issues
- Small Tensors on GPU
// Problem: GPU overhead for small operations
let small = Tensor::ones(vec![10, 10])?;
let slow = small.to_backend(BackendType::Wgpu)?; // Overhead > computation
// Solution: Use CPU for small tensors
let fast = small; // Stay on CPU
- Frequent Backend Conversions
// Problem: Repeated conversions
for i in 0..1000 {
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
let result = gpu_tensor + 1.0;
let back_to_cpu = result.to_backend(BackendType::Cpu)?;
}
// Solution: Convert once
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;
for i in 0..1000 {
gpu_tensor = gpu_tensor + 1.0; // Stay on GPU
}
let final_result = gpu_tensor.to_backend(BackendType::Cpu)?;
- Memory Fragmentation
// Problem: Large temporary allocations
let huge_temp = (huge_a * huge_b) + huge_c; // 3 large tensors in memory
// Solution: In-place operations (when available)
let result = huge_a.mul_add(&huge_b, &huge_c)?; // Hypothetical in-place op
Performance Debugging Checklist
- Profile first: Measure before optimizing
- Check backend selection: Ensure optimal backend for workload
- Monitor memory transfers: GPU transfer costs often dominate
- Verify operation fusion: Combine operations when possible
- Consider batch size: Larger batches amortize overhead
- Test different tensor sizes: Performance characteristics vary by size
- Use appropriate data types: f32 vs f64 performance difference
- Monitor memory usage: Avoid memory pressure and swapping
Hardware-Specific Optimization
CPU Optimization
- Use all available cores (Rayon handles this automatically)
- Ensure sufficient memory bandwidth
- Consider NUMA topology for large systems
- Link with optimized BLAS (OpenBLAS, Intel MKL)
GPU Optimization
- Ensure sufficient GPU memory
- Consider tensor sizes that align with GPU architecture
- Use appropriate batch sizes for GPU utilization
- Monitor thermal throttling on mobile/laptop GPUs
Memory Hierarchy
- L1/L2 cache: Small frequently-accessed tensors
- System RAM: Medium tensors and CPU operations
- GPU VRAM: Large tensors for GPU operations
- Storage: Streaming large datasets
Conclusion
Tensor Frame performance optimization requires understanding:
- Workload characteristics: Size, operations, access patterns
- Backend strengths: CPU for small/mixed, GPU for large parallel
- Memory costs: Transfer overhead, allocation patterns
- Platform specifics: Hardware capabilities and limitations
Use profiling tools to guide optimization decisions and always measure performance improvements to ensure they provide real benefits for your specific use case.
Contributing to Tensor Frame
We welcome contributions to Tensor Frame! This guide will help you get started with contributing to the project.
Getting Started
Development Setup
- Clone the repository:
git clone https://github.com/TrainPioneers/Tensor-Frame.git
cd Tensor-Frame
- Install Rust (if not already installed):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
- Install development dependencies:
# For documentation building
cargo install mdbook
# For benchmarking
cargo install criterion
# For code formatting
rustup component add rustfmt
# For linting
rustup component add clippy
- Build and test:
# Build with all features
cargo build --all-features
# Run tests
cargo test
# Run with specific backend
cargo test --features wgpu
cargo test --features cuda
Development Workflow
Building the Project
# Quick compilation check
cargo check
# Build with specific backends
cargo build --features wgpu
cargo build --features cuda
cargo build --all-features
# Release build
cargo build --release --all-features
Running Tests
# Run all tests
cargo test
# Test specific backend
make test-wgpu
make test-cuda
# Test with verbose output
cargo test -- --nocapture
# Run specific test
cargo test test_tensor_creation
Code Formatting and Linting
# Format code
cargo fmt
# Check formatting
cargo fmt --check
# Run clippy lints
cargo clippy
# Run clippy with all features
cargo clippy --all-features
# Fix clippy warnings
cargo clippy --fix
Documentation
# Generate API documentation
cargo doc --open
# Build the book
cd docs
mdbook build
# Serve book locally
mdbook serve
Contribution Guidelines
Code Style
- Formatting: Use
cargo fmt
for consistent formatting - Linting: Address all
cargo clippy
warnings - Naming: Use descriptive names following Rust conventions
- Comments: Document public APIs and complex algorithms
- Error Handling: Use proper
Result
types and meaningful error messages
Testing
All contributions must include appropriate tests:
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_new_feature() {
let tensor = Tensor::zeros(vec![2, 3]).unwrap();
let result = tensor.new_operation().unwrap();
assert_eq!(result.shape().dims(), &[2, 3]);
}
#[test]
fn test_error_handling() {
let tensor = Tensor::zeros(vec![2, 3]).unwrap();
let result = tensor.invalid_operation();
assert!(result.is_err());
}
}
Documentation Requirements
- Public APIs: All public functions, structs, and traits must have documentation
- Examples: Include usage examples in documentation
- Error Cases: Document when functions return errors
- Safety: Document any unsafe code usage
/// Creates a new tensor filled with zeros.
///
/// # Arguments
/// * `shape` - The dimensions of the tensor
///
/// # Returns
/// A new tensor filled with zeros, or an error if the shape is invalid.
///
/// # Examples
/// ```
/// use tensor_frame::Tensor;
///
/// let tensor = Tensor::zeros(vec![2, 3])?;
/// assert_eq!(tensor.numel(), 6);
/// # Ok::<(), tensor_frame::TensorError>(())
/// ```
///
/// # Errors
/// Returns `TensorError::InvalidShape` if any dimension is zero.
pub fn zeros(shape: Vec<usize>) -> Result<Self> {
// Implementation
}
Types of Contributions
Bug Fixes
-
Report the issue: Create a GitHub issue with:
- Clear reproduction steps
- Expected vs actual behavior
- Environment details (OS, Rust version, GPU info)
- Minimal code example
-
Fix the bug:
- Create a focused fix addressing the specific issue
- Add regression tests to prevent recurrence
- Update documentation if the bug was in documented behavior
New Features
Before implementing new features:
-
Discuss the feature: Open a GitHub issue to discuss:
- Use case and motivation
- Proposed API design
- Implementation approach
- Performance implications
-
Implementation guidelines:
- Follow existing patterns and conventions
- Implement for all relevant backends
- Add comprehensive tests
- Update documentation and examples
Backend Implementation
New operations should be implemented across all backends:
// src/backend/mod.rs
pub trait Backend {
// Add new operation to trait
fn new_operation(&self, input: &Storage) -> Result<Storage>;
}
// src/backend/cpu.rs
impl Backend for CpuBackend {
fn new_operation(&self, input: &Storage) -> Result<Storage> {
match input {
Storage::Cpu(data) => {
// CPU implementation using Rayon
let result: Vec<f32> = data
.par_iter()
.map(|&x| compute_new_operation(x))
.collect();
Ok(Storage::Cpu(result))
}
_ => Err(TensorError::BackendError("Invalid storage type".to_string())),
}
}
}
// src/backend/wgpu.rs
impl Backend for WgpuBackend {
fn new_operation(&self, input: &Storage) -> Result<Storage> {
match input {
Storage::Wgpu(wgpu_storage) => {
// WGPU implementation using compute shaders
self.execute_compute_shader(
&wgpu_storage.buffer,
include_str!("../shaders/new_operation.wgsl")
)
}
_ => Err(TensorError::BackendError("Invalid storage type".to_string())),
}
}
}
Performance Improvements
- Benchmark first: Establish baseline performance
- Profile the bottleneck: Use profiling tools to identify issues
- Implement optimization: Make targeted improvements
- Measure improvement: Verify performance gains
- Add performance tests: Prevent performance regressions
// Add benchmark for new optimization
use criterion::{criterion_group, criterion_main, Criterion};
fn bench_optimized_operation(c: &mut Criterion) {
let tensor = Tensor::ones(vec![1000, 1000]).unwrap();
c.bench_function("optimized_operation", |b| {
b.iter(|| {
tensor.optimized_operation().unwrap()
});
});
}
criterion_group!(benches, bench_optimized_operation);
criterion_main!(benches);
Documentation Improvements
- API documentation: Improve function/struct documentation
- Examples: Add or improve usage examples
- Guides: Write tutorials for specific use cases
- Book: Contribute to the mdbook documentation
Backend-Specific Contributions
CPU Backend
- Optimization: Improve Rayon parallelization
- BLAS integration: Better integration with optimized BLAS libraries
- Memory layout: Optimize for cache efficiency
WGPU Backend
- Shader optimization: Improve WGSL compute shaders
- New operations: Implement missing operations (matmul, reductions)
- Platform support: Improve compatibility across graphics APIs
CUDA Backend
- Kernel optimization: Improve CUDA kernel performance
- cuBLAS integration: Better integration with cuBLAS/cuDNN
- Memory management: Optimize GPU memory usage
Pull Request Process
Before Submitting
- Ensure tests pass:
cargo test --all-features
- Check formatting and lints:
cargo fmt --check
cargo clippy --all-features
- Update documentation:
cargo doc --all-features
cd docs && mdbook build
- Add changelog entry (if applicable):
## [Unreleased]
### Added
- New tensor operation `my_operation` (#123)
### Fixed
- Fixed broadcasting bug in GPU backend (#124)
Pull Request Template
## Description
Brief description of the changes and motivation.
## Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
## Testing
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [ ] I have tested with different backends (CPU/WGPU/CUDA)
## Checklist
- [ ] My code follows the code style of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] Any dependent changes have been merged and published
Review Process
- Automated checks: CI will run tests, linting, and formatting checks
- Code review: Maintainers will review for:
- Code quality and style
- Test coverage
- Documentation completeness
- Performance implications
- API design consistency
- Feedback: Address review feedback and update the PR
- Approval: Once approved, maintainers will merge the PR
Issue Reporting
Bug Reports
Use the bug report template:
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Create tensor with '...'
2. Call operation '....'
3. See error
**Expected behavior**
A clear and concise description of what you expected to happen.
**Code Example**
```rust
use tensor_frame::Tensor;
let tensor = Tensor::zeros(vec![2, 3])?;
let result = tensor.problematic_operation()?; // This fails
Environment:
- OS: [e.g. Ubuntu 20.04]
- Rust version: [e.g. 1.75.0]
- Tensor Frame version: [e.g. 0.1.0]
- GPU info: [if applicable]
- Backend: [CPU/WGPU/CUDA]
Additional context Add any other context about the problem here.
### Feature Requests
Use the feature request template:
```markdown
**Is your feature request related to a problem?**
A clear and concise description of what the problem is.
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
**Use case**
Describe how this feature would be used in practice.
**API Design** (if applicable)
```rust
// Proposed API
let result = tensor.new_operation(parameters)?;
Additional context Add any other context about the feature request here.
## Community Guidelines
### Code of Conduct
- Be respectful and inclusive
- Focus on constructive feedback
- Help newcomers learn and contribute
- Celebrate diverse perspectives and backgrounds
### Communication
- **GitHub Issues**: Bug reports, feature requests, design discussions
- **GitHub Discussions**: General questions, show and tell, ideas
- **Pull Requests**: Code contributions and reviews
### Recognition
Contributors are recognized in:
- `CONTRIBUTORS.md` file
- Release notes for significant contributions
- GitHub contributor statistics
## Getting Help
If you need help contributing:
1. **Read existing code**: Look at similar implementations for patterns
2. **Check documentation**: API docs and this book contain guidance
3. **Ask questions**: Open a GitHub issue or discussion
4. **Start small**: Begin with bug fixes or documentation improvements
Thank you for contributing to Tensor Frame!