Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing
Cheng Deng¹, Luoyang Sun², Jiwen Jiang², Yongcheng Zeng²,
Xinjian Wu³, Wenxin Zhao¹, Qingfa Xiao¹, Jiachuan Wang⁴, Haoyang Li⁵,
Lei Chen¹˒⁴, Lionel M. Ni¹, Haifeng Zhang², Jun Wang³˒⁶
¹ The Hong Kong University of Science and Technology (Guangzhou)
² Institution of Automation, Chinese Academy of Sciences
³ University College London
⁴ The Hong Kong University of Science and Technology
⁵ The Hong Kong Polytechnic University
⁶ UCL Centre for Artificial Intelligence
Corresponding authors: Cheng Deng, Lei Chen, Jun Wang
The PLM (Peripheral Language Model) series introduces a novel model architecture to peripheral computing by delivering powerful language capabilities within the constraints of resource-limited devices. Through modeling and system co-design strategy, PLM optimizes model performance and fits edge system requirements, PLM employs Multi-head Latent Attention and squared ReLU activation to achieve sparsity, significantly reducing memory footprint and computational demands. Coupled with a meticulously crafted training regimen using curated datasets and a Warmup-Stable-Decay-Constant learning rate scheduler, PLM demonstrates superior performance compared to existing small language models, all while maintaining the lowest activated parameters, making it ideally suited for deployment on diverse peripheral platforms like mobile phones and Raspberry Pis.
Experimental cases
Raspberry Pi
NVIDIA Jetson Orin
Macbook Air M3
Oneplus 12 Pro (Snapdragon 8 Gen 3)
PLM uses MLA (Multi-head Latent Attention), which is an advanced architectural enhancement designed to improve the efficiency of large language models (LLMs) by significantly reducing the memory footprint associated with key-value (KV) caching. However, PLM leverages MLA with preserving the full-rank structure of the query projections to balance training efficiency and inference speed effectively
We adopt a streamlined Feed-Forward Network (FFN) architecture that omits gating mechanisms, thereby reducing computational complexity and memory usage. ReLU² introduces quadratic growth in activation values, enabling the network to capture richer feature representations. ReLU² demonstrates exceptional performance by balancing sparsity and computational efficiency.
We train PLM on a total of 2.48 trillion tokens. During training, we employ a Warm-up–Stable–Decay–Constant learning rate scheduler, introducing an additional constant-learning-rate phase following the decay stage.
During supervised fine-tuning (SFT), we conduct training in two progressive phases utilizing the LLaMA-Factory toolkit.
In Phase 1, we leverage diverse, general-purpose instruction-following datasets covering chat interactions, web-based queries, and general instructions. In Phase 2, we extend this training by placing a substantial emphasis on specialized domains, particularly mathematics and coding.
During the preference training phase, we adopt a novel approach similar to our previously proposed method, ARIES, leveraging its self-refinement loss to perform preference training on PLM.
where the refinement loss is employed to stimulate and strengthen the model's self-refinement capability.
PLM-1.8B is a strong and reliable model, particularly in basic knowledge understanding, coding and simple reasoning tasks.
PLM demonstrates highly competitive performance along with a series of advantages stemming from its modeling and system co-design. These benefits include impressive inference speed, extreme sparsity, and reduced KV cache due to MLA, enabling it to outperform models with the same number of layers when handling long-context inference tasks at certain sequence lengths.
Evaluations on LLM benchmarks demonstrate that our sparsified MLA models, PLM, maintain competitive performance.
The performance is evidenced by the averaged benchmark scores including general knowledge comprehension, Math problem solving, coding proficiency, and commonsense and logical reasoning.
Meanwhile, the activated parameters are determined by the minimal computation required to preserve modeling performance.
So, as we can see PLM activates the least parameters.
We evaluate efficiency metrics of LLMs across various hardware configurations and quantization levels, clearly delineating optimal hardware-quantization combinations. PLM effectively strikes a balance between robust performance and high inference speed.
Effect of token length on tokens per second when model prefill.
Effect of token length on tokens per second when model generate.
Effect of GPU offload layer number on tokens per second when model prefill.
Effect of GPU offload layer number on tokens per second when model generate.
Deploying models on edge devices involves challenges such as limited computational resources, memory capacity, and stringent latency requirements. While Multi Head Latent Attention (MLA) incurs higher computational overhead than traditional multi-head attention (MHA) and Global Query Attention (GQA), its optimized cache utilization and inference efficiency make it particularly suited for edge deployment.
We have put forward the GGUF version of the PLM and raised a PR to llama.cpp. Here we present a fork version.
Coming soon.