Go to file

Sean Sube 2026148ba3 Some checks failed Build and Test / test (push) Has been cancelled Details Build and Test / docker-test (push) Has been cancelled Details Build and Test / lint (push) Has been cancelled Details Build and Test / security (push) Has been cancelled Details raw robot output		2025-08-11 16:23:29 -05:00
.github/workflows	raw robot output	2025-08-11 16:23:29 -05:00
build_and_test.sh	raw robot output	2025-08-11 16:23:29 -05:00
fft_avx512_final.s	raw robot output	2025-08-11 16:23:29 -05:00
fft_avx512_optimized.s	raw robot output	2025-08-11 16:23:29 -05:00
fft_avx512_working.s	raw robot output	2025-08-11 16:23:29 -05:00
fft_avx512.s	raw robot output	2025-08-11 16:23:29 -05:00
fft_test.go	raw robot output	2025-08-11 16:23:29 -05:00
fft.go	raw robot output	2025-08-11 16:23:29 -05:00
go.mod	raw robot output	2025-08-11 16:23:29 -05:00
Makefile	raw robot output	2025-08-11 16:23:29 -05:00
QUICKSTART.md	raw robot output	2025-08-11 16:23:29 -05:00
README.md	raw robot output	2025-08-11 16:23:29 -05:00
simple_build.sh	raw robot output	2025-08-11 16:23:29 -05:00

README.md

Golang AVX512 Fast Fourier Transform

This project implements a Fast Fourier Transform (FFT) using Go's x86 assembly dialect with AVX512 instructions for maximum performance on modern Intel processors.

Features

AVX512 Optimized: Uses the latest AVX512 vector instructions for maximum performance
Automatic Fallback: Falls back to pure Go implementation if AVX512 is not available
Power of 2 Support: Automatically pads input to the next power of 2 for optimal FFT performance
Complex Number Support: Full support for complex128 data types
Inverse FFT: Includes IFFT implementation for complete FFT functionality

Requirements

Go 1.21 or later
Intel processor with AVX512 support (Skylake-X, Cascade Lake, Ice Lake, or newer)
Linux x86_64 environment

Installation

go mod tidy

Usage

package main

import (
    "fmt"
    "complex128"
)

func main() {
    // Create test data
    data := []complex128{
        complex(1, 0),
        complex(2, 0),
        complex(3, 0),
        complex(4, 0),
        complex(5, 0),
        complex(6, 0),
        complex(7, 0),
        complex(8, 0),
    }

    // Perform forward FFT
    fftResult := FFT(data)
    fmt.Println("FFT Result:", fftResult)

    // Perform inverse FFT
    ifftResult := IFFT(fftResult)
    fmt.Println("IFFT Result:", ifftResult)
}

API

`FFT(data []complex128) []complex128`

Performs Fast Fourier Transform on the input data. Automatically detects AVX512 support and uses the optimized assembly implementation when available.

`IFFT(data []complex128) []complex128`

Performs Inverse Fast Fourier Transform to recover the original signal from the frequency domain.

Performance

The AVX512 implementation provides significant performance improvements over the pure Go version:

Vectorization: Processes 8 complex numbers simultaneously using 512-bit ZMM registers
Optimized Memory Access: Uses aligned memory operations and efficient data movement
Reduced Function Call Overhead: Critical loops are implemented entirely in assembly

Implementation Details

Algorithm

The implementation uses the Cooley-Tukey FFT algorithm with the following optimizations:

Bit-Reversal Permutation: Efficiently reorders input data for optimal memory access patterns
Radix-2 Decimation: Processes data in powers of 2 for maximum efficiency
Twiddle Factor Optimization: Pre-computes and broadcasts trigonometric values using AVX512

Assembly Features

ZMM Registers: Uses 512-bit vector registers for maximum throughput
SIMD Operations: Leverages AVX512 instructions like VMOVUPD, VADDPD, VSUBPD
Broadcasting: Uses VBROADCASTSD for efficient twiddle factor distribution
Memory Alignment: Ensures optimal memory access patterns

Building

# Build with optimizations
go build -ldflags="-s -w" -o fft

# Run
./fft

Testing

# Run tests
go test -v

# Benchmark performance
go test -bench=.

Limitations

Input length must be a power of 2 (automatically padded if necessary)
Requires AVX512-capable processor
Currently optimized for complex128 data types
Assembly implementation is x86_64 specific

Future Improvements

Support for non-power-of-2 lengths using mixed-radix FFT
Real-to-complex FFT optimization
Multi-threaded implementation for very large datasets
Support for other data types (float64, complex64)

License

This project is open source and available under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.