golang-fft/README.md
Sean Sube 2026148ba3
Some checks failed
Build and Test / test (push) Has been cancelled
Build and Test / docker-test (push) Has been cancelled
Build and Test / lint (push) Has been cancelled
Build and Test / security (push) Has been cancelled
raw robot output
2025-08-11 16:23:29 -05:00

129 lines
3.6 KiB
Markdown

# Golang AVX512 Fast Fourier Transform
This project implements a Fast Fourier Transform (FFT) using Go's x86 assembly dialect with AVX512 instructions for maximum performance on modern Intel processors.
## Features
- **AVX512 Optimized**: Uses the latest AVX512 vector instructions for maximum performance
- **Automatic Fallback**: Falls back to pure Go implementation if AVX512 is not available
- **Power of 2 Support**: Automatically pads input to the next power of 2 for optimal FFT performance
- **Complex Number Support**: Full support for complex128 data types
- **Inverse FFT**: Includes IFFT implementation for complete FFT functionality
## Requirements
- Go 1.21 or later
- Intel processor with AVX512 support (Skylake-X, Cascade Lake, Ice Lake, or newer)
- Linux x86_64 environment
## Installation
```bash
go mod tidy
```
## Usage
```go
package main
import (
"fmt"
"complex128"
)
func main() {
// Create test data
data := []complex128{
complex(1, 0),
complex(2, 0),
complex(3, 0),
complex(4, 0),
complex(5, 0),
complex(6, 0),
complex(7, 0),
complex(8, 0),
}
// Perform forward FFT
fftResult := FFT(data)
fmt.Println("FFT Result:", fftResult)
// Perform inverse FFT
ifftResult := IFFT(fftResult)
fmt.Println("IFFT Result:", ifftResult)
}
```
## API
### `FFT(data []complex128) []complex128`
Performs Fast Fourier Transform on the input data. Automatically detects AVX512 support and uses the optimized assembly implementation when available.
### `IFFT(data []complex128) []complex128`
Performs Inverse Fast Fourier Transform to recover the original signal from the frequency domain.
## Performance
The AVX512 implementation provides significant performance improvements over the pure Go version:
- **Vectorization**: Processes 8 complex numbers simultaneously using 512-bit ZMM registers
- **Optimized Memory Access**: Uses aligned memory operations and efficient data movement
- **Reduced Function Call Overhead**: Critical loops are implemented entirely in assembly
## Implementation Details
### Algorithm
The implementation uses the Cooley-Tukey FFT algorithm with the following optimizations:
1. **Bit-Reversal Permutation**: Efficiently reorders input data for optimal memory access patterns
2. **Radix-2 Decimation**: Processes data in powers of 2 for maximum efficiency
3. **Twiddle Factor Optimization**: Pre-computes and broadcasts trigonometric values using AVX512
### Assembly Features
- **ZMM Registers**: Uses 512-bit vector registers for maximum throughput
- **SIMD Operations**: Leverages AVX512 instructions like `VMOVUPD`, `VADDPD`, `VSUBPD`
- **Broadcasting**: Uses `VBROADCASTSD` for efficient twiddle factor distribution
- **Memory Alignment**: Ensures optimal memory access patterns
## Building
```bash
# Build with optimizations
go build -ldflags="-s -w" -o fft
# Run
./fft
```
## Testing
```bash
# Run tests
go test -v
# Benchmark performance
go test -bench=.
```
## Limitations
- Input length must be a power of 2 (automatically padded if necessary)
- Requires AVX512-capable processor
- Currently optimized for complex128 data types
- Assembly implementation is x86_64 specific
## Future Improvements
- Support for non-power-of-2 lengths using mixed-radix FFT
- Real-to-complex FFT optimization
- Multi-threaded implementation for very large datasets
- Support for other data types (float64, complex64)
## License
This project is open source and available under the MIT License.
## Contributing
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.