Kernels are simple algorithms, typically components of more complex applications. The kernels included here highlight features of the Mireplica™ Visual-Processing Architecture. Their simplicity amplifies the potential impact of overhead that is sometimes hidden by heavy computational workloads. Together they illustrate how this architecture overcomes limitations to highly parallel processing, particularly with respect to conventional SIMD- and/or SMP-based implementations. Here’s a brief description of each kernel, and the particular challenge addressed by each. The links open documents with implementation and performance details:
Sobel Edge Detect
Erosion can be considered as generating binary pixels (white = “true”) for pixels in the source image that meet certain criteria. In this example, the output pixels are based on the input neighborhood having a gradient of values (in any direction) that exceeds a specified threshold. The range in context size from CIF to HD-1080 resolution is over a factor of 20. Despite this, throughput, in terms of mega-pixels per second (Mpix/sec), scales linearly. Typically, beyond a certain granularity, overhead is required to re-form shared context in a global memory, which in turn requires synchronizing processing elements to have a consistent view of this context.
Sobel Edge Detect implements a filter whose output brightness at a given position is a function of the gradient at that position. In this respect it is similar to erosion, except that the output pixels are continuous – not binary – based on the magnitude of the gradient vector. Since the gradient vector is in any direction, the magnitude is given by the square root of the sum of squares of horizontal and vertical gradients. In this example, the square root is provided by a 64k-entry lookup table (pixels are 16-bit values). This avoids the cycles to do the square-root computation, but requires a large table that is shared by up to 240 processing elements. In conventional systems, either the table is replicated and distributed, or processing serializes on table access – the wider the processing elements, the more severe this serialization is. With the Mireplica architecture, scaling is linear with context size even though all processing elements share a single global table.
Defective Pixel Correction
Fast Fourier Transform
Defective Pixel Correction adjusts the values of pixels whose magnitudes are beyond a threshold of the minimum or maximum of the neighboring 3x3 region. The adjusted value is the value in the neighborhood just below the maximum or just above the minimum. This particular kernel operates on red-green-blue (RGB) values, and so has three times the context size and input/output bandwidth of erosion and Sobel edge detect. Linear scaling of throughput still is achieved even though the ratio of input/output to computation is much higher. This also demonstrates the ability to scale performance even though the input format is interleaved: red, green, and blue pixels are contiguous, though the kernel operates on them independently.
Fast Fourier Transform (FFT) has a very small number of operations per input sample pair, but places enormous demand on the interconnect bandwidth between parallel processing elements in order to implement the “butterfly” (a simple 16-point butterfly is shown in the diagram: the smallest size in the table is 128-point). The final stage of the butterfly requires communication across the entire vector of samples, so a 1024-point FFT requires communicating 512 vector elements in each direction across the processing elements. In this case, scaling is good, but not linear, because the interconnect bandwidth is only about 20% of that required for linear scaling. While this bandwidth could be implemented, it’s generally not the case that applications have such high communication demands for such little computation. Even in the case of FFT, scaling would be closer to linear if the FFT were just a component of a larger application-processing chain.
Mandelbrot Sets have limited application in visual processing, but this example is included to represent a couple of significant challenges to parallel execution. This algorithm conceptually implements a small inner loop per pixel, with the loop iterating from 1 to 256 times based on conditions related to the pixel position. A straightforward mapping is to implement this loop in parallel across a large vector of pixels (this example uses vectors of the width of scan-lines). However, loop termination conditions also must be implemented in parallel. This requires a means to detect that all positions in the vector have reached the termination condition, which is normally time-consuming and difficult to implement, using pair-wise comparisons from the starting vector of conditions to a single value. This inefficiency is compounded by the fact that, on many lines, most of the positions meet the termination condition very early, but there can be a few positions on the line that iterate the maximum number of times (the lighter regions in the figure represent higher numbers of iterations, with white being the maximum). Thus, in the straightforward case, most positions on the line are idle while a few positions are active, causing a significant loss in parallelism. As explained in the accompanying detailed description, the results shown here use features of the architecture to overcome both of these limitations.