November 1, 2023 by Debajyoti Ray

Scaling Useful Magic: Optimizing Transformer Models for the Enterprise World

Large-language models have created magic. ChatGPT has been the fastest adopted technology, and the conversation has shifted to how it can be useful. And to be adopted in enterprise applications, it must be scalable.

I started my foray into Generative AI, creating the world’s first AI-written movie in 2018, which seemed like magic. Over the next several years, my focus was on finding how AI would be useful, especially in the Marketing and Communications domain. The MarComms market is huge, and had visionaries in large enterprises. Like my cofounder Mark Seall, then at Siemens, thinking about how Generative AI would change their world. Now, at InferenceCloud, our focus has been on making a scalable platform on which we can build useful applications for enterprise customers.

Most commercial ventures focused on learning foundational models over the past few years. To make it useful, we must deal with the computational complexities for inference on LLMs. The good news is that the world of AI research is rising to the challenge, developing optimization techniques to make these models more accessible and efficient.

The Art of Optimization

Optimization techniques for transformer models primarily focus on reducing model size and enhancing performance. The relationship between the number of parameters in a model and its accuracy is not linear, and there’s a saturation point beyond which adding more parameters doesn’t necessarily improve accuracy. This understanding forms the basis of many optimization techniques.

One such technique is pruning, which involves removing less important parts of the self-attention unit while maintaining the regular structure of the model. There’s also quantization, which involves reducing the precision of the model’s parameters without significantly affecting its performance.

Other techniques include neural architecture search, which involves using machine learning to find the most efficient model architecture, and lightweight network design, which aims to create models that are less computationally intensive.

The Role of Hardware

Hardware-aware pruning, for instance, ensures that the full performance benefit of pruning is realized. Moreover, specific hardware optimization techniques, such as pipelining and optimizing matrix-multiplication operations, can significantly boost efficiency.

Hardware-aware Neural Architecture Search (HW-NAS) methods incorporate performance metrics of the underlying hardware platform in the search method as a multi-objective optimization function. This approach ensures that the resulting model is optimized for the specific hardware it will be running on, further enhancing efficiency.

The Future is Optimized

The high computational cost associated with training and fine-tuning large models has led to a growing demand for more robust and scalable computing infrastructure. This has motivated researchers to propose techniques for reducing transformers’ size, latency, and energy consumption for efficient inference for a wide range of applications.

For instance, the SpAtten technique prunes unimportant tokens and heads in natural languages to optimize transformer models, making them more efficient for tasks such as content generation. Similarly, techniques like PSAQ-ViT and LightHuBERT effectively transfer knowledge from large teacher networks to smaller student models, making them more efficient and accessible for enterprise applications.

Optimized transformer models can be used to do real-time chained inferences, where the output of several models (Notebooks, in our case) is used to train or fine-tune other models downstream. They can be used for collaborative content generation, requiring a high level of efficiency, speed, and accuracy. By chaining inference models, we can learn from the outputs of multiple users across different divisions. This forms the foundation of supercharging organizational intelligence, which is the mission of InferenceCloud.