golang-fft/README.md

# Golang AVX512 Fast Fourier Transform

This project implements a Fast Fourier Transform (FFT) using Go's x86 assembly dialect with AVX512 instructions for maximum performance on modern Intel processors.

## Features

- **AVX512 Optimized**: Uses the latest AVX512 vector instructions for maximum performance
- **Automatic Fallback**: Falls back to pure Go implementation if AVX512 is not available
- **Power of 2 Support**: Automatically pads input to the next power of 2 for optimal FFT performance
- **Complex Number Support**: Full support for complex128 data types
- **Inverse FFT**: Includes IFFT implementation for complete FFT functionality

## Requirements

- Go 1.21 or later
- Intel processor with AVX512 support (Skylake-X, Cascade Lake, Ice Lake, or newer)
- Linux x86_64 environment

## Installation

```bash
go mod tidy
```

## Usage

```go
package main

import (
    "fmt"
    "complex128"
)

func main() {
    // Create test data
    data := []complex128{
        complex(1, 0),
        complex(2, 0),
        complex(3, 0),
        complex(4, 0),
        complex(5, 0),
        complex(6, 0),
        complex(7, 0),
        complex(8, 0),
    }

    // Perform forward FFT
    fftResult := FFT(data)
    fmt.Println("FFT Result:", fftResult)

    // Perform inverse FFT
    ifftResult := IFFT(fftResult)
    fmt.Println("IFFT Result:", ifftResult)
}
```

## API

### `FFT(data []complex128) []complex128`
Performs Fast Fourier Transform on the input data. Automatically detects AVX512 support and uses the optimized assembly implementation when available.

### `IFFT(data []complex128) []complex128`
Performs Inverse Fast Fourier Transform to recover the original signal from the frequency domain.

## Performance

The AVX512 implementation provides significant performance improvements over the pure Go version:

- **Vectorization**: Processes 8 complex numbers simultaneously using 512-bit ZMM registers
- **Optimized Memory Access**: Uses aligned memory operations and efficient data movement
- **Reduced Function Call Overhead**: Critical loops are implemented entirely in assembly

## Implementation Details

### Algorithm
The implementation uses the Cooley-Tukey FFT algorithm with the following optimizations:

1. **Bit-Reversal Permutation**: Efficiently reorders input data for optimal memory access patterns
2. **Radix-2 Decimation**: Processes data in powers of 2 for maximum efficiency
3. **Twiddle Factor Optimization**: Pre-computes and broadcasts trigonometric values using AVX512

### Assembly Features
- **ZMM Registers**: Uses 512-bit vector registers for maximum throughput
- **SIMD Operations**: Leverages AVX512 instructions like `VMOVUPD`, `VADDPD`, `VSUBPD`
- **Broadcasting**: Uses `VBROADCASTSD` for efficient twiddle factor distribution
- **Memory Alignment**: Ensures optimal memory access patterns

## Building

```bash
# Build with optimizations
go build -ldflags="-s -w" -o fft

# Run
./fft
```

## Testing

```bash
# Run tests
go test -v

# Benchmark performance
go test -bench=.
```

## Limitations

- Input length must be a power of 2 (automatically padded if necessary)
- Requires AVX512-capable processor
- Currently optimized for complex128 data types
- Assembly implementation is x86_64 specific

## Future Improvements

- Support for non-power-of-2 lengths using mixed-radix FFT
- Real-to-complex FFT optimization
- Multi-threaded implementation for very large datasets
- Support for other data types (float64, complex64)

## License

This project is open source and available under the MIT License.

## Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.