.github/workflows | ||
build_and_test.sh | ||
fft_avx512_final.s | ||
fft_avx512_optimized.s | ||
fft_avx512_working.s | ||
fft_avx512.s | ||
fft_test.go | ||
fft.go | ||
go.mod | ||
Makefile | ||
QUICKSTART.md | ||
README.md | ||
simple_build.sh |
Golang AVX512 Fast Fourier Transform
This project implements a Fast Fourier Transform (FFT) using Go's x86 assembly dialect with AVX512 instructions for maximum performance on modern Intel processors.
Features
- AVX512 Optimized: Uses the latest AVX512 vector instructions for maximum performance
- Automatic Fallback: Falls back to pure Go implementation if AVX512 is not available
- Power of 2 Support: Automatically pads input to the next power of 2 for optimal FFT performance
- Complex Number Support: Full support for complex128 data types
- Inverse FFT: Includes IFFT implementation for complete FFT functionality
Requirements
- Go 1.21 or later
- Intel processor with AVX512 support (Skylake-X, Cascade Lake, Ice Lake, or newer)
- Linux x86_64 environment
Installation
go mod tidy
Usage
package main
import (
"fmt"
"complex128"
)
func main() {
// Create test data
data := []complex128{
complex(1, 0),
complex(2, 0),
complex(3, 0),
complex(4, 0),
complex(5, 0),
complex(6, 0),
complex(7, 0),
complex(8, 0),
}
// Perform forward FFT
fftResult := FFT(data)
fmt.Println("FFT Result:", fftResult)
// Perform inverse FFT
ifftResult := IFFT(fftResult)
fmt.Println("IFFT Result:", ifftResult)
}
API
FFT(data []complex128) []complex128
Performs Fast Fourier Transform on the input data. Automatically detects AVX512 support and uses the optimized assembly implementation when available.
IFFT(data []complex128) []complex128
Performs Inverse Fast Fourier Transform to recover the original signal from the frequency domain.
Performance
The AVX512 implementation provides significant performance improvements over the pure Go version:
- Vectorization: Processes 8 complex numbers simultaneously using 512-bit ZMM registers
- Optimized Memory Access: Uses aligned memory operations and efficient data movement
- Reduced Function Call Overhead: Critical loops are implemented entirely in assembly
Implementation Details
Algorithm
The implementation uses the Cooley-Tukey FFT algorithm with the following optimizations:
- Bit-Reversal Permutation: Efficiently reorders input data for optimal memory access patterns
- Radix-2 Decimation: Processes data in powers of 2 for maximum efficiency
- Twiddle Factor Optimization: Pre-computes and broadcasts trigonometric values using AVX512
Assembly Features
- ZMM Registers: Uses 512-bit vector registers for maximum throughput
- SIMD Operations: Leverages AVX512 instructions like
VMOVUPD
,VADDPD
,VSUBPD
- Broadcasting: Uses
VBROADCASTSD
for efficient twiddle factor distribution - Memory Alignment: Ensures optimal memory access patterns
Building
# Build with optimizations
go build -ldflags="-s -w" -o fft
# Run
./fft
Testing
# Run tests
go test -v
# Benchmark performance
go test -bench=.
Limitations
- Input length must be a power of 2 (automatically padded if necessary)
- Requires AVX512-capable processor
- Currently optimized for complex128 data types
- Assembly implementation is x86_64 specific
Future Improvements
- Support for non-power-of-2 lengths using mixed-radix FFT
- Real-to-complex FFT optimization
- Multi-threaded implementation for very large datasets
- Support for other data types (float64, complex64)
License
This project is open source and available under the MIT License.
Contributing
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.