Locality-Aware Redundancy Pruning for LLM Depth Compression
Researchers have proposed a new pruning framework, Locality-Aware Redundancy Pruning (LoRP), to improve the inference efficiency of large language models by removing redundant Transformer blocks.
Large language models contain representational redundancy across network depth, according to a study submitted to arXiv on May 27, 2026[1]. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions, but LoRP uses representation locality to guide depth pruning. The framework computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy[1]. A related study on arXiv, submitted on April 27, 2026, and revised on May 26, 2026, found that prior work typically treats layer redundancy as an inherent structural property of pretrained networks[2]. However, the study also found that different calibration configurations produce different pruning patterns, and that complex search algorithms yield marginal performance improvements over simple one-shot methods[2].
infrastructureresearch-papertool-release
Background sources we checked (2)
- arxiv.org ↗ Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across arch…
- en.wikipedia.org ↗ An algorithm is a fundamental set of rules or defined procedures that are typically designed and used to be a simpler way to solve a specific problem or a broad set of problems. Simply speaking, algorithms define different processes, sets of rules and regulations, or methodologie…