The biggest challenge to broader commercialization of AI has been its computational speed required to implement inference and training with a particular model and set of data structures. AI acceleration aims to solve the problem of speeding up AI computation and reducing the sizes of required models and data structures, thereby increasing the computational throughput and frequency in real applications.
This involves two approaches: selecting the right hardware architecture to accelerate AI, and designing the right models that require lower compute workloads. The former is the traditional approach to AI acceleration; it involved implementing computation in parallel with progressively larger processors (and eventually GPUs) in order to keep compute times low. The latter is newer and is being enabled by software-based approaches that reduce workloads without requiring larger processors.
Today’s most advanced approaches to on-device AI inference and model training in the cloud use both approaches together to provide inference with the highest possible throughput. In this article, we’ll outline some of the AI acceleration steps that can be implemented in common hardware platforms and in common languages for building AI models as part of a larger embedded application.
AI Acceleration Happens in Hardware and Software
The goal in AI model acceleration is two-fold: to reduce the size of models and data structures involved in AI computation, and to speed up inference to produce useful results. The same idea applies in training. Throughout the majority of AI history, hardware acceleration was rudimentary, basically involving throwing computing resources at problems until the computation time became reasonable. Now that so much research attention has shifted to neural network development, software-level acceleration techniques are also commonplace and are aided by open-source/vendor libraries and code examples.
For AI in embedded systems, the goal has been to get compute to be fast enough that it can be reliably implemented in small microcontrollers. While we may not have reasonably fast training in such small systems, something as simple as an ESP32 or an Arduino board can perform very fast inference with simple models using the software acceleration steps outlined above. Other popular microcontrollers like STM32 have plenty of vendor support for implementing software-based AI acceleration to speed up inference tasks.
Simpler AI inference models can be run with small data structures on an Arduino as long as the necessary model acceleration steps are implemented.
Training vs. Inference
The other side of implementing AI in embedded systems is model training, which requires processing significant amounts of data and building a predictive model based on that data. The processing steps required for model training are typically performed in the cloud/at the edge rather than being implemented on the device. While on-device training may be fine for simpler models with small input datasets, the training time becomes far too long to be practically useful on an embedded device with a single small processor.
The acceleration methods outlined below are most often looked at for inference, but they can apply just as well in training.
In terms of a proper definition, quantization refers to a broad set of mathematical techniques that convert a large set of inputs to a small set of outputs. In terms of AI computation, quantization refers to approximating the weights, inputs, outputs, and intermediate results from multiply-accumulate operations with low precision integer numbers. There is a normalization required in converting to a lower bit depth number across the neural network inputs.
Quantization can insert errors into the AI computation process, particularly when using continuous signals as inputs. These signals would normally be digitized with an ADC so these signals are already quantized, but additional quantization will introduce new error in the digital representation of the input signal. Although quantization can be applied after the network is finished, quantization of inputs and weight definitions should be applied at the beginning of training to ensure highest accuracy.
The idea behind pruning is to reduce the overall size of a model by removing some neurons, or possibly entire layers. As a side effect of quantization, there is an upper limit on the number of layers and weights beyond which accuracy will not improve, so it makes sense to limit the number of parameters in a neural network.
Pruning is performed by eliminating all weights in a neural network that are deemed to be least significant. The definition of “least significant” could involve removal of neurons whose weights fall below a certain threshold. By reducing the model size, the total number of compute operations can be reduced in inference. Pruning can also be used to reduce the number of training iterations by dynamically adjusting the number of neurons and layers dynamically when designing the neural network architecture.
Sparsity is a property of matrices and tensors, where they may have zero valued elements, and thus they incur redundant or unmeaningful computation. As these elements can produce zero-valued outputs which do not add to the accuracy of the model, they can be removed from the multiply-accumulate operations. This involves a simple logical check that runs through the integer values stored in a register. Removing these zeroes from tensor/matrix operations reduces the overall memory and processing requirements in both training and inference.
The steps outlined above can technically be considered pre-processing at the model level, but the entries in the dataset can also be pre-processed. When applied to a dataset, pre-processing can involve multiple techniques that aim to improve the accuracy of inference results when the model is applied to new data. If pre-processing is used in inference, then the same pre-processing should be applied to the dataset used in training. This ensures maximum correspondence between the model, inference results, and the training dataset.
This term does not refer to the actual processing being done in inference and training within an AI-capable system. Instead, it refers to everything else the system has to do in order to access and use an AI model. This makes workflow optimization difficult to generalize as the exact definition of “optimal workflow” is different for every system and it requires optimizing multiple steps or processes that might have nothing to do with computation in inference or training.
In short, an optimized workflow should minimize the amount of processing a device must perform in order to capture data, pre-process it, pass it to an AI model, execute inference/training, and interpret the results. This could involve a combination of software and hardware factors that have little to do with AI. Some examples include:
- Examining system processing and tagging algorithms for efficiency improvements
- Implementing a more efficient tagging mechanism (for data capture)
- Consolidating ASICs required for processing into a single chip like an FPGA
- Consider parallelization among AI-critical processing tasks and non-critical tasks
More on AI Acceleration in Hardware
There are ways to accelerate AI implementations in hardware. In the not-too-distant past, there were no hardware implementations of AI; everything was run as embedded software, so the only way to really accelerate AI computation was to throw more computing resources at AI problems and to optimize the models themselves. This meant a larger/faster processor, an additional general-purpose processor, or using dedicated GPUs to run AI models.
AI Accelerator Chipsets for Inference
Today, there are hardware platforms that provide AI acceleration as a direct hardware implementation. There are three options currently on the market:
- AI accelerator ASICs
- New SoCs with an in-package AI core (equivalent to #2)
AI accelerator chip vendors and FPGA vendors now offer tools that allow instantiation of models in hardware using their custom interconnect fabrics, which greatly reduces the number of combinational + sequential logic steps involved in traditional AI computation. For AI accelerator chips, developers can use open-source or vendor-provided libraries to run drivers for implementing and running models with the software steps outlined above. For FPGAs, vendor tools are used to reduce a TensorFlow + TinyML model to a VHDL implementation, so the model processing is performed directly on silicon.
One of the better-known AI accelerator chip and module product lines is the Coral platform from Google. This AI accelerator chip, which they coined their “Tensor Processing Unit (TPU)”, implements the logic required for tensor arithmetic directly on silicon. Users can pass TensorFlow Lite models to the chip as well as input data for training, and the results would be passed back to a system host processor for use in an embedded application.
Image of the Google Coral TPU chip. [Image credit: Google]
Hardware Acceleration for Training
Hardware acceleration and software acceleration also play an important role in training. In the cloud or at the edge, the system that will oversee training generally has multiple tasks to perform, so an accelerator chipset or module will be dedicated to processing data and training models. The software-based AI acceleration steps outlined above are also used as part of model training to reduce the overall training time by reducing the number of logic operations required per training iteration.
Given the huge amount of data required for training these models, and the increasing frequency with which models are retrained, dedicated processing modules for AI training tasks make the most sense. Servers in a data center have multiple tasks to perform at any given moment, so dedicated hardware modules for training are installed in these servers in PCIe card slots. GPUs are also typically used, but FPGAs are also highly competitive for dedicated training tasks in a data center environment.
Taken together, the hardware and software approaches outlined here are highly effective for AI acceleration. When you’re ready to design your embedded system with AI acceleration, use Allegro PCB Designer, the industry’s best PCB design and analysis software from Cadence. Allegro users can access a complete set of schematic capture features, mixed-signal simulations in PSpice, and powerful CAD features, and much more.