Peripheral Language Model

Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing

Cheng Deng¹, Luoyang Sun², Jiwen Jiang², Yongcheng Zeng²,

Xinjian Wu³, Wenxin Zhao¹, Qingfa Xiao¹, Jiachuan Wang⁴, Haoyang Li⁵,

Lei Chen¹˒⁴, Lionel M. Ni¹, Haifeng Zhang², Jun Wang³˒⁶

¹ The Hong Kong University of Science and Technology (Guangzhou)

² Institution of Automation, Chinese Academy of Sciences

³ University College London

⁴ The Hong Kong University of Science and Technology

⁵ The Hong Kong Polytechnic University

⁶ UCL Centre for Artificial Intelligence

Corresponding authors: Cheng Deng, Lei Chen, Jun Wang

What is PLM?

The PLM (Peripheral Language Model) series introduces a novel model architecture to peripheral computing by delivering powerful language capabilities within the constraints of resource-limited devices. Through modeling and system co-design strategy, PLM optimizes model performance and fits edge system requirements, PLM employs Multi-head Latent Attention and squared ReLU activation to achieve sparsity, significantly reducing memory footprint and computational demands. Coupled with a meticulously crafted training regimen using curated datasets and a Warmup-Stable-Decay-Constant learning rate scheduler, PLM demonstrates superior performance compared to existing small language models, all while maintaining the lowest activated parameters, making it ideally suited for deployment on diverse peripheral platforms like mobile phones and Raspberry Pis.

arXiv Paper

Github

Hugging Face

Experimental cases

Raspberry Pi

NVIDIA Jetson Orin

Macbook Air M3

Oneplus 12 Pro (Snapdragon 8 Gen 3)

PLM Architecture

Attention: MLA

PLM uses MLA (Multi-head Latent Attention), which is an advanced architectural enhancement designed to improve the efficiency of large language models (LLMs) by significantly reducing the memory footprint associated with key-value (KV) caching. However, PLM leverages MLA with preserving the full-rank structure of the query projections to balance training efficiency and inference speed effectively

Activation Function: ReLU² (Squared ReLU)

We adopt a streamlined Feed-Forward Network (FFN) architecture that omits gating mechanisms, thereby reducing computational complexity and memory usage. ReLU² introduces quadratic growth in activation values, enabling the network to capture richer feature representations. ReLU² demonstrates exceptional performance by balancing sparsity and computational efficiency.

Training Phases

Three Phases Pretraining

We train PLM on a total of 2.48 trillion tokens. During training, we employ a Warm-up–Stable–Decay–Constant learning rate scheduler, introducing an additional constant-learning-rate phase following the decay stage.

Two Phases SFT

During supervised fine-tuning (SFT), we conduct training in two progressive phases utilizing the LLaMA-Factory toolkit.

In Phase 1, we leverage diverse, general-purpose instruction-following datasets covering chat interactions, web-based queries, and general instructions. In Phase 2, we extend this training by placing a substantial emphasis on specialized domains, particularly mathematics and coding.

Two Phases Preference Training

During the preference training phase, we adopt a novel approach similar to our previously proposed method, ARIES, leveraging its self-refinement loss to perform preference training on PLM.

where the refinement loss is employed to stimulate and strengthen the model's self-refinement capability.

Performance

PLM-1.8B is a strong and reliable model, particularly in basic knowledge understanding, coding and simple reasoning tasks.

Why PLM?

PLM demonstrates highly competitive performance along with a series of advantages stemming from its modeling and system co-design. These benefits include impressive inference speed, extreme sparsity, and reduced KV cache due to MLA, enabling it to outperform models with the same number of layers when handling long-context inference tasks at certain sequence lengths.

Learning the PLM: Sparsity

Evaluations on LLM benchmarks demonstrate that our sparsified MLA models, PLM, maintain competitive performance.

The performance is evidenced by the averaged benchmark scores including general knowledge comprehension, Math problem solving, coding proficiency, and commonsense and logical reasoning.
Meanwhile, the activated parameters are determined by the minimal computation required to preserve modeling performance.

So, as we can see PLM activates the least parameters.

Somehow Efficiency

We evaluate efficiency metrics of LLMs across various hardware configurations and quantization levels, clearly delineating optimal hardware-quantization combinations. PLM effectively strikes a balance between robust performance and high inference speed.

Learning the PLM: MLA

Effect of token length on tokens per second when model prefill.

Effect of token length on tokens per second when model generate.

Effect of GPU offload layer number on tokens per second when model prefill.

Effect of GPU offload layer number on tokens per second when model generate.

Deploying models on edge devices involves challenges such as limited computational resources, memory capacity, and stringent latency requirements. While Multi Head Latent Attention (MLA) incurs higher computational overhead than traditional multi-head attention (MHA) and Global Query Attention (GQA), its optimized cache utilization and inference efficiency make it particularly suited for edge deployment.

How to use PLM?

llama.cpp

We have put forward the GGUF version of the PLM and raised a PR to llama.cpp. Here we present a fork version.

PowerInfer

Coming soon.

Page updated

Google Sites

Report abuse