The Mireplica® Visual-Processing Architecture employs a radically new form of instruction generation and execution to address one of the remaining frontiers in computer architecture: programmable vision. It enables programmable solutions that are three orders-of-magnitude faster than conventional processors, yet easily programmed using standard C++, and adapted to existing instruction sets and toolchains. [Introductory Video]
Mireplica® Technology: Efficient Visual Processing
Computer vision, for example feature and gesture recognition, is an important emerging application area, particularly for battery-operated devices including mobile handsets, tablets, and untethered robotics. Software programmability and energy efficiency are essential – but conflicting – goals. It’s undesirable to commit algorithms to hardware, especially since many are based on heuristics and continue to evolve and improve. However, typical programmable solutions are orders-of-magnitude less energy efficient than hardware-based solutions. Mireplica Technology has developed a fundamentally new programmable architecture for this application area that is easy to program and nearly optimum in performance and energy efficiency.
What Does Efficiency Require?
A key figure of merit for energy-efficient computing is “throughput density,” expressed in terms of application operations (not necessarily instructions) per second per unit area of silicon. Ideally: computing area comprises nothing but functional units and memory, instructions are only those directly required by the algorithms (for example, arithmetic operations only, with no branches or function calls), and the instruction-per-cycle (IPC) rate per datapath is 1. This minimizes power and energy: power is related to switched capacitance, which is proportional to area, and energy is related to the number of switching transitions, which is inversely proportional to throughput.
Why Do Conventional Techniques Fall Short?
Optimizing throughput density argues for efficiently exploiting the parallelism that is inherent in pixel frames and related data. However, high efficiency isn’t possible using conventional parallel programmable architectures. Symmetric multi-processing (SMP) organizations have a large area overhead with respect to the functional units that perform operations, as well as a large cycle overhead for communication and synchronization, especially given the fine-grained, pixel-level dataflow required for visual processing. A typical alternative is to exploit parallelism using single-instruction, multiple-data (SIMD) organizations, which increase the number of operations per instruction, typically to 32 or 64 operations. However, most images are much wider than 32-64 pixels, and processing has no natural limits on the span of data dependencies across an image frame.
For example, feature detection using scale-invariant feature transform (SIFT) relies on a series of upsampling, Gaussian blur, subsampling, difference operations, and min/max comparisons. At high-definition resolution (1920x1080), the total context size is over 66M pixels, and pixel regions involved in computations range from 26 to over 700 pixels – with a unique region for each pixel in the source frame. A detected feature at any location can depend on pixel information from any or all other positions in the image. Each location has a unique set of dependencies, so there is no natural partition of the image data that can be allocated across multiple SIMD units, and SIMDs have physical limitations that prevent them from efficiently sharing data across entire image frames. Consequently, though SIMDs somewhat improve throughput density, they essentially degenerate into single-processor or SMP organizations beyond a small scope of the image.
A Radically New Approach
The Mireplica® Visual-Processing Architecture employs a form of instruction generation and execution that is unlike any existing computing architecture. It allows an effectively unlimited number of functional units to operate in parallel on visual data, with effectively unlimited sharing of data across image frames. Despite the novel implementation, it retains the sequential semantics of SIMD execution, and the hardware is fully abstracted from the programmer using C++ object classes to represent image frames and related data. It doesn’t require new compiler or processor development, because the implementation appears as a custom functional unit in any processor that supports instruction customization. Custom instructions are completely abstracted by the C++ class library, and are analogous to a compiler’s intermediate language to generate instructions for the target processing platform, hiding details of this platform from users.
Performance metrics described elsewhere on this site are derived from a cycle- and functionally accurate C++ model. These typically demonstrate linear scaling of throughput with processing elements, and also linear scaling of cycles with the vertical dimension, instead of quadratic scaling (vertical times horizontal). This results in speedups of over two orders of magnitude compared to a sequential processor. Datapaths and memory are dynamically allocated according to resolution, another important aspect to minimizing power and energy.