Applications are differentiated from kernels by two factors: they contain multiple iterations or multiple stages of smaller kernel components, and typically represent a much higher computational workload in comparison to kernels. A wide variety of applications have been ported and profiled on the Mireplica™ Visual-Processing Architecture, presenting a wide range of potential computing challenges. As a result, the architecture is stable and robust despite having a general-purpose application space.
The applications highlighted here are a broadly diverse sample of applications, and each presents different challenges not only to programmability, but, more significantly, to programming on a highly parallel implementation. The links open documents with implementation and performance details.
Scale-Invariant Feature Transform
Disparity Mapping extracts 3D information from two stereoscopic, 2D images. It is based on parallax: that is, the fact that objects nearer to the foreground have higher apparent motion between the two images than objects nearer to the background (in the image above, lighter pixels have higher apparent motion).
Disparity mapping is similar to motion estimation used in video compression. The left image is kept fixed, and the right image is scanned across this image, starting out overlaid (no displacement offset), up to a displacement of 64 pixels, with a displacement step of a single pixel per iteration. At each iteration, for each pixel in the image, a squared-sum-of-differences (SSD) is performed to compare the left and right images, in an 8x8 pixel region surrounding the pixel. The final result captures the displacement with the best match between the two images (minimum SSD). Higher displacements correspond to pixels closer to the foreground, and higher brightness.
The challenges presented by this application are that the contexts used are different, by one pixel, at each iteration, and that the neighborhoods for the SSD contain a total of 64 pixels. An integral image of the SSD is generated at each displacement, requiring summing all SSD values in the image to the left and above a given pixel, across the entire image. This requires summing pixels both across scan-line vectors and over all scan-lines, but simplifies the SSD operation, reducing it to four arithmetic operations instead of 63 for the 8x8 neighborhood (how is beyond the scope here).
As should be obvious from this description, it is very difficult (or, impossible) to partition this workload across multiple cores without a significant amount of overhead to form the required contexts and manage 64 different versions of the integral image, one for each displacement.
Connected-Component Labeling (CCL) assigns a unique label to structures of pixels (components) that abut each other in any direction (this doesn’t include being adjacent at corners). In the figure, the label is represented by a unique color. It is difficult to perform this labeling in a single pass over the image because, as illustrated by the figure, connectivity is based on relationships that exist far removed from labeled components. Also, the most straightforward implementation is based on some form of an exhaustive search of potential pixel connectivity.
This implementation performs the labeling shown using a single top-to-bottom pass over the image, and handles all forms of connectivity in terms of direction of the connection and the distance from the labeled pixel to the connection that determines the label.
Throughput scaling is not linear for this application because it makes very heavy use of shared histogram tables to make the equivalent of labeling decisions. These tables create read-write dependencies within the inner loop, because, to achieve a single pass, all labeling decisions must be made between a pair of scan-lines in an iteration before the next pair of scan-lines can be considered in the next iteration. Despite this, performance is very good considering the heavy use of the global shared memory, and the application is robust in considering all classes of connectivity.
Scale-Invariant Feature Transform (SIFT) is used to detect characteristics of object features (for example, locations of edges and corners) independent, within limits, of location, distance, and rotation of the object in the image. These features are used to classify the object, for example supporting searching a number of images or video for an object that matches characteristics of the searched object.
The input to this application is a pre-blurred image (as shown), which is upsampled to an image of twice the dimensions. This stabilizes the detection of features (makes them less sensitive to small perturbations), but quadruples the effective input context – for example, requiring a context size of 3840x2160 for full 1920x1080 HD resolution. Successive stages, described briefly below, are motivated by the requirement to detect object features regardless of relative sizes and positions in the image (i.e., location and rotation).
From the input, the first step is to form the base of a Gaussian pyramid, with 5 levels of 2D Gaussian blurring applied to form an octave (4 levels) and a scale above and below the octave (2 levels). From this base, the pyramid is built containing a successive number of octaves by down-sampling the appropriate level of the previous octave, to half the dimensions, to form the input level (one below the new octave), then generating another 5 levels of blur. This is repeated up to the pyramid level having less than 64 scan-lines.
To maintain precision, the Gaussian blur is applied to a larger region at each level: 11x11 pixels from the first level, then, for successive levels, 15x15, 17x17, 23x23, and 27x27. This requires a context, for each pixel, and in succession, of 121, 225, 289, 529, and 729 pixels. It also requires, for the blur operation, a respective total of multiply-accumulate (MAC) operations, per pixel. For perspective on the computational complexity, 1920x1080 resolution has 7 pyramid octaves, starting at 8M pixels per scale (50M total in the octave), with 16 billion MACs. Successive octaves divide these numbers by 4 from the previous octave, for a total context of 66M pixels and 22 billion MACs.
The Gaussian pyramid is then used to form a difference-of-Gaussian (DOG) pyramid, with 5 levels at each octave, formed by subtracting values in adjacent scales in the Gaussian pyramid. Features are detected by comparing all pixels in the inner 3 levels to pixels in the local 3x3 neighborhood within the scale, and the corresponding 3x3 neighborhoods in the adjacent scales above and below. This requires a total of up to 25 min/max comparisons per pixel (processing can stop at any point if the min/max test fails). Raw features are based on the octave, scale, and 2D location of local min/max pixels, and these are then used to interpolate fractional pixel locations and to filter points based on meeting certain relevancy criteria. The final result consists of these feature points and the DOG pyramid, which can be used in more detailed (inherently serial) analysis.
Obviously, it’s not sufficient to provide just the raw computational power, because it’s extremely difficult to maintain the contexts required, given the varying scope of the blur operations and the upsampling and downsampling required to form pyramid octaves. Even solving this problem, parallelism is limited by pyramid dependencies on lower levels, the fact that upper pyramid levels are relatively narrow, and, for the most part, that feature detection results in just a few number of detected features despite all the work required to detect them (in this example, less than 30 total for HD).
SIFT throughput does not scale linearly with processing because increasing the resolution increases the workload (number of octaves). However, all solutions have this issue. For example, for SIFT at HD resolution, an Intel i7 processor takes over 40 billion cycles. The Mireplica Architecture takes about 25M cycles, representing a speedup of 1600x using 960 datapaths. The speedup is higher than the number of processing elements because the Mireplica Architecture doesn’t have issues related to cache misses and load/store latency, as well as using a very large number of registers efficiently.
It does have to be observed, though, that this is the only context in which the phrase “scaling a DOG pyramid” not only is meaningful, but something to be concerned about…