# Golang AVX512 Fast Fourier Transform This project implements a Fast Fourier Transform (FFT) using Go's x86 assembly dialect with AVX512 instructions for maximum performance on modern Intel processors. ## Features - **AVX512 Optimized**: Uses the latest AVX512 vector instructions for maximum performance - **Automatic Fallback**: Falls back to pure Go implementation if AVX512 is not available - **Power of 2 Support**: Automatically pads input to the next power of 2 for optimal FFT performance - **Complex Number Support**: Full support for complex128 data types - **Inverse FFT**: Includes IFFT implementation for complete FFT functionality ## Requirements - Go 1.21 or later - Intel processor with AVX512 support (Skylake-X, Cascade Lake, Ice Lake, or newer) - Linux x86_64 environment ## Installation ```bash go mod tidy ``` ## Usage ```go package main import ( "fmt" "complex128" ) func main() { // Create test data data := []complex128{ complex(1, 0), complex(2, 0), complex(3, 0), complex(4, 0), complex(5, 0), complex(6, 0), complex(7, 0), complex(8, 0), } // Perform forward FFT fftResult := FFT(data) fmt.Println("FFT Result:", fftResult) // Perform inverse FFT ifftResult := IFFT(fftResult) fmt.Println("IFFT Result:", ifftResult) } ``` ## API ### `FFT(data []complex128) []complex128` Performs Fast Fourier Transform on the input data. Automatically detects AVX512 support and uses the optimized assembly implementation when available. ### `IFFT(data []complex128) []complex128` Performs Inverse Fast Fourier Transform to recover the original signal from the frequency domain. ## Performance The AVX512 implementation provides significant performance improvements over the pure Go version: - **Vectorization**: Processes 8 complex numbers simultaneously using 512-bit ZMM registers - **Optimized Memory Access**: Uses aligned memory operations and efficient data movement - **Reduced Function Call Overhead**: Critical loops are implemented entirely in assembly ## Implementation Details ### Algorithm The implementation uses the Cooley-Tukey FFT algorithm with the following optimizations: 1. **Bit-Reversal Permutation**: Efficiently reorders input data for optimal memory access patterns 2. **Radix-2 Decimation**: Processes data in powers of 2 for maximum efficiency 3. **Twiddle Factor Optimization**: Pre-computes and broadcasts trigonometric values using AVX512 ### Assembly Features - **ZMM Registers**: Uses 512-bit vector registers for maximum throughput - **SIMD Operations**: Leverages AVX512 instructions like `VMOVUPD`, `VADDPD`, `VSUBPD` - **Broadcasting**: Uses `VBROADCASTSD` for efficient twiddle factor distribution - **Memory Alignment**: Ensures optimal memory access patterns ## Building ```bash # Build with optimizations go build -ldflags="-s -w" -o fft # Run ./fft ``` ## Testing ```bash # Run tests go test -v # Benchmark performance go test -bench=. ``` ## Limitations - Input length must be a power of 2 (automatically padded if necessary) - Requires AVX512-capable processor - Currently optimized for complex128 data types - Assembly implementation is x86_64 specific ## Future Improvements - Support for non-power-of-2 lengths using mixed-radix FFT - Real-to-complex FFT optimization - Multi-threaded implementation for very large datasets - Support for other data types (float64, complex64) ## License This project is open source and available under the MIT License. ## Contributing Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.