Wideband FFT¶

Purpose¶

This FFT was originally sourced from ASTRON via OpenCores. It performs an N-Point Wideband FFT on data that is partly applied in serial and partly applied in parallel. This FFT specifically suits applications where the sample clock is higher than the DSP processing clock. For each output stream a subband statistic unit is included which can be read via the memory mapped interface.

This unit connects an incoming array of streaming interfaces to the wideband fft. The output of the wideband fft is connected to a set of subband statistics units. The statistics can be read via the memory mapped interface (TODO). A control unit takes care of the correct composition of the output streams(sop,eop,sync,bsn,err). These signals are optional and can be removed to only use the sync signal.

This unit only handles one sync at a time. Therefore the sync interval should be larger than the total pipeline stages of the wideband FFT.

Module Overview¶

An overview of the fft_wide unit is shown in Figure 1. The fft_wide unit calculates a N-point FFT and has P number of input streams. Data of each input is offered to a M-point pipelined FFT, where M=N/P. The output of all pipelined FFTs is then connected to a P-point parallel FFT that performs the final stage of the wideband FFT. Each output of the parallel FFT is connected to a subband statistics unit that calculates the power in each subband. The MM interface is used to read out the subband statistics. The rTwoSDF pipelined FFT (see R2SDF FFT) design is used as building block for the development of the wideband extension.

Simulink wideband FFT in base configuration.

Firmware Interface¶

Clock Domains¶

There are two clock domains used in the fft_wide unit: the mm_clk and the dp_clk domain. Figure 2 shows an overview of the clock domains in the fft_wide unit. The only unit that is connected to both clock domains is the memory of the subband statistics module. This memory is a dual ported ram that holds the results of the subband statistics. Table 1 lists both clocks and their characteristics.

Name	Frequency (MHz)	Description
DP_CLK	200 MHz	Clock for datapath
MM_CLK	125 MHz	Clock for mm interface

Interface signals¶

Figure 2 shows the Simulink Wideband FFT block in its base configuration. In this form, it offers a minimal set of input/output ports that are comparable with what the CASPER FFT offers.

Figure 3 shows the Simulink Wideband FFT block in its expanded configuration which offers the bsn, sop, eop, error, empty and channel control signals. Their function is explained by the graph below.

Simulink wideband FFT in its expanded configuration.

The full set of signals available to the Simulink block are detailed in the table below.

Signal	Type	Size	Description
Reset	std_logic	1	Datapath reset.
Clock enable	std_logic	1	Clock enable signal (used by Xilinx black box).
Clock	std_logic	1	Datapath clock (used by Xilinx black box).
Sync	std_logic	1	In/out sync pulse, preceeds data by 1 clock cycle.
Valid	std_logic	1	In/out valid data signal. Goes high with first valid data sample.
Shiftreg	std_logic_vector	\(\log2(nof\_points)\)	Bit vector dictating at which stages to shift in an N-point FFT. A ‘1’ indicates a shift while a ‘0’ indicates no shift at that stage.
Ovflw	std_logic_vector	\(\log2(nof\_points)\)	Bit vector dictating at which stages overflow occured in an N-point FFT. A ‘1’ indicates an overflow while a ‘0’ indicates no overflow at that stage.
bsn	std_logic_vector	64	A timestamp identification port for the data.
sop	std_logic	1	A start-of-packet indicator (see figure 4 for detail).
eop	std_logic	1	An end-of-packet indicator (see figure 4 for detail).
Empty	std_logic_vector	16	Empty signal for the sosi data packet.
Error	std_logic_vector	32	Error indicator giving 32 different one-hot encoded errors.
Channels	std_logic_vector	32	An indicator for mapping of channels to streams.
Im	std_logic_vector	in_dat_w or out_dat_w	Data port for either one polarisation (when doing a dual pol FFT), or the imaginary part (when doing a complex FFT).
Re	std_logic_vector	in_dat_w or out_dat_w	Data port for either one polarisation (when doing a dual pol FFT), or the real part (when doing a complex FFT).

Complex FFT¶

For complex input use_separate = false. When use_reorder=true then the output bins of the FFT are re-ordered to undo the bit-reversed (or bit-flipped) default radix 2 FFT output order. The fft_r2_wide then outputs first 0 Hz and the positive frequencies and then the negative frequencies. The use_reorder is performed at both the pipelined stage and the parallel stage.

When use_fft_shift=true then the fft_r2_wide then outputs the frequency bins in incrementing order, so first the negative frequencies, then 0 Hz and then the positive frequencies. When use_fft_shift = true then also use_reorder must be true.

Two Real FFT’s¶

When use_separate=true then the fft_r2_wide can be used to process two real streams. The first real stream (A) presented on the real input, the second real stream (B) presented on the imaginary input. The separation unit outputs the spectrum of A and B in an alternating way. When use_separate = true then also use_reorder must be true. When use_separate = true then the use_fft_shift must be false, because fft_shift() only applies to spectra for complex input.

Remarks¶

This FFT supports a wb_factor = 1 (= only a fft_r2_pipe instance) or wb_factor = g_fft.nof_points (= only a fft_r2_par instance). Care must be taken to properly account for guard_w and out_gain_w, therefore it is best to simply use a structural approach that generates seperate instances for each case:

wb_factor = 1 –> pipelined FFT
wb_factor > 1 AND wb_factor < g_fft.nof_points –> wideband FFT
wb_factor = g_fft.nof_points –> parallel FFT

This FFT uses the use_reorder in the pipeline FFT, in the parallel FFT and also has reorder memory in the fft_sepa_wide instance. The reorder memories in the FFTs can maybe be saved by using only the reorder memory in the fft_sepa_wide instance. This would require changing the indexing in fft_sepa_wide instance (TODO).

The reorder memory in the pipeline FFT, parallel FFT and in the fft_sepa_wide could make reuse of a reorder component from the reorder library instead of using a dedicated local solution (TODO).

Parameters¶

Both the wideband and pipelined FFT’s offer a set of parameters for control over the FFT’s characteristics, data handling and implementation on the FPGA. These are tabulated below.

FFT parameters¶

Generic	Type	Value	Description
Bit-reverse output	Boolean	true	When set to ‘true’, the output bins of the FFT are reordered in such a way that the first bin represents the lowest frequency and the highest bin represents the highest frequency.
Reorder frequencies	Boolean	false	False for [0, pos, neg] bin frequencies order, true for [neg, 0, pos] bin frequencies order in case of complex input
Separate complex in/out ports?	Boolean	true	When set to ‘true’ a separate algorithm will be enabled in order to retrieve two separate spectra from the output of the complex FFT in case both the real and imaginary input of the complex FFT are fed with two independent real signals.
Nof channels	Natural	0	Defines the number of channels (=time-multiplexed input signals). The number of channels is \(2^{nof\_channels}\). Multiple channels is only supported by the pipelined FFT.
Wideband factor	Natural	4	The number that defines the wideband factor. It defines the number of parallel pipelined FFTs.
Number of points	Natural	1024	The number of points of the FFT. The number of points is \(2^{nof\_points}\).
Extra control signals	Boolean	false	Checking this box enables the usage of addition signals. See WB Arch for detail on these ports.

Data parameters¶

Generic	Type	Value	Description
Input data width	Natural	8	Width in bits of the input data. This value specifies the width of both the real and the imaginary part.
out_dat_w	Natural	14	The bitwidth of the real and imaginary part of the output of the FFT. The relation with the in_dat_w is as follows: \(out\_dat\_w=in\_dat\_w+(\log2(nof\_N))/{2+1}\).
stage_dat_w	Natural	18	The bitwidth of the data that is used between the stages (=DSP multiplier-width).
guard_w	Natural	2	Number of bits that function as guard bits. The guard bits are required to avoid overflow in the first two stages of the FFT.
guard_enable	Boolean	true	When set to ‘true’ the input is guarded during the input resize function, when set to ‘false’ the input is not guarded, but the scaling is not skipped on the last stages of the FFT (based on the value of guard_w).
Rounding behaviour	String	“ROUND”	Gives control over the removal of the least significant bits when requantising. See WB Quant for further detail. Options are “ROUND” or “TRUNCATE”.
Overflow behaviour	String	“WRAP”	Gives control over the removal of the most significant bits when requantising. See WB Quant for further detail. Options are “WRAP” and “SATURATE”.

Synth/Imp Parameters¶

Generic	Type	Value	Description
Use DSP for Cmults	String	“YES”	Sets the Xilinx use_dsp directive to force usage or non-usage of DSP48 elements when synthesizing/implementing the complex multipliers.
Cmult options	String	“4DSP”	Sets which complex multipliers are used in the FFT. Options are “3DSP” for a Gaussian complex multiplication instantiation that uses 3 DSP48 elements or “4DSP” for a classic complex multiplication that uses 4 DSP48 elements.
Vendor	Natural	0	0 for Xilinx FPGA’s, 1 for Intel FPGA’s.
RAM primitive	STRING	“auto”	Parameter for the xpm BRAM module which will dictate how BRAM’s are implemented on the FPGA. Options are “auto”, “distributed”, “ultra” and “block”.
FIFO primitive	STRING	“auto”	Parameter for the xpm FIFO module which will dictate how FIFO’s are implemented on the FPGA. Options are “auto”, “distributed”, “ultra” and “block”.

Module Architecture¶

Several subdesigns were defined in order to create the eventual wideband decimation in frequency (DIF) FFT. These sub-designs are:

Complex Pipelined FFT for two real inputs (fft_r2_pipe).
Complex Parallel FFT for two real inputs (fft_r2_par).
Complex Wideband FFT for two real inputs (fft_r2_wide).

fft_r2_pipe¶

The architecture for a pipelined FFT is based on design units from the rTwoSDF_lib and is basically the same as the rTwoSDF unit. The difference with respect to the rTwoSDF unit is that the fft_r2_pipe unit must be capable of processing two real inputs as well. Therefore the archtectural block diagram is extended with an optional separate function. Figure 4 gives an archtectural overview of the design.

fft_r2_par¶

In the case of a parallel FFT, all time domain samples for a slice come in parallel and therefore all multiplications and additions have to be performed in parallel as well. The architecture for a parallel for a parallel FFT is shown in Figure 5. In Figure 5, the number of points is set to 16. Each square represents an optimised complex butterfly. The numbers in the butterfly refer to the exponent k in \(W_N^k\) (the twiddle factors). The parallel FFT is also capable of reordering the output data and processing two real inputs. Therefore a parallel reorder and parallel separate function are defined as well.

fft_r2_wide¶

The wideband variant of the FFT is partly pipelined and partly composed in parallel. The amount of parallelization is specified by P (wideband factor). The architecture is shown in Figure 6. The reorder functionality is inherited from both the fft_r2_par and fft_r2_pipe units, but for the separation functionality a dedicated wideband variant must be designed.

Quantisation¶

Requantisation is required for every butterfly in the FFT. With the FFT being a DIF FFT, the butterfly operation is shown in Figure 7 and it’s algorithmic operation is detailed below:

\[x' = x + y\]

\[y' = W_N^k \times (x - y)\]

Each butterfly performs an addition and multiplication. For fixed point number systems of values in the range \((0.5, -0.5]\) we get MSB growth from additions and LSB growth from multiplications. The LSB growth from multiplication may be sliced away (which introduces minor effects due to the rounding scheme). This wideband FFT offers the option to truncate or round away from zero (TODO: introduce even rounding). MSB growth from addition will cause overflow that may be handled in two ways: SATURATE the value or WRAP it. Wrapping will use no logic while saturation will require logic to prevent natural wrapping. Ideally however, the effect of overflow in the FFT is irreversible and as such should be prevented. This is done by scaling the data by 2 before entry into each stage, and by extension each butterfly operating in parallel for that stage. Fine-control over which stages the FFT should apply a shift for is possible by populating the shiftregister port. Should overflow occur in any of the butterflies in the FFT, the overflow register will populate that bit-index with a ‘1’. Figure 8 shows the location of the overflow reporting and shift within the butterfly.

The shift operation acts on value \(A\) as:

\[A' = round(A >> 1, "Rounding behaviour")\]

where \(round()\) is a function that rounds the value \(A\) according to the “Rounding behaviour” specified see WB FFT Params.

Overflow is detected in a stage by inspecting the sign of the input to the adder and sign of the output. This check differs depending on whether a subtraction or addition is being performed. The following VHDL snippets make this check (TODO):