Mher Safaryan

Bio

I am an Assistant Professor (UK Lecturer) in the School of Mathematical Sciences at Lancaster University, within the MARS (Mathematics for AI in Real-world Systems) section.

I was a postdoctoral researcher at the Institute of Science and Technology Austria (ISTA), working in the group led by Prof. Dan Alistarh. I was awarded a Marie Skłodowska-Curie Fellowship through the MSCA COFUND IST-BRIDGE program at ISTA. As part of the fellowship, I had the opportunity to work with Dr. Alexandre Marques during an industrial secondment at Neural Magic, Inc. (now acquired by Red Hat) in the USA. Before joining ISTA in 2022, I was a postdoctoral research fellow at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia from 2019 to 2022, in the group led by Prof. Peter Richtárik. Prior to that, I worked with Prof. Diogo Gomes at KAUST as a research technician from 2016 to 2019. I obtained my Ph.D. in Mathematics in 2018 from Yerevan State University (YSU) in Armenia, under the supervision of Prof. Grigori Karagulyan.

Research Interests

optimization (theory and algorithms), machine learning, federated learning
large-scale, convex/non-convex, stochastic/deterministic optimization, variance reduction
communication/computation/memory effcient and scalable optimization algorithms
collaborative learning (asynchronous, adversarial, local training, heterogeneity, etc.)
model compression (knowledge distillation, pruning, sparse optimization, quantization)
information theory (compression, encoding schemes, vector quantization)

Research

My research focuses on optimization theory and algorithms for machine learning, with an emphasis on efficiency, scalability, and the theoretical understanding of optimization methods. These methods are particularly relevant for large-scale machine learning training and federated learning.

I completed my Ph.D. in real harmonic analysis, a branch of mathematics that explores the relationship between functions or signals and their frequency domain representations. My thesis investigated the convergence and divergence properties of certain convolution-type integral operators. In addition, I have done some research in algebra. During my undergraduate studies at YSU, I completed a research project on universal algebraic structures called dimonoids, which led to a publication in Algebra and Discrete Mathematics. Later, at KAUST, I worked on symbolic computation, specifically on developing computer algebra techniques for automating certain aspects of PDE analyses.

For the complete list of my publications, please visit my Google Scholar page.

News

February 2026

3 new papers on arXiv.

Towards Robust Scaling Laws for Optimizers - joint work with Alexandra Volkova, Christoph H. Lampert, Dan Alistarh.

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication - joint work with Andrej Jovanović, Alex Iacob, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane.

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers - joint work with Ionut-Vlad Modoranu, Philip Zmushko, Erik Schultheis, Dan Alistarh.

January 2026

Paper accepted to MLSys 2026.

CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training - joint work with Soroush Tabesh, Andrei Panferov, Alexandra Volkova, Dan Alistarh.

January 2026

3 papers accepted to ICLR 2026.

FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models - joint work with Ionut-Vlad Modoranu, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh.

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates - joint work with Alex Iacob, Andrej Jovanovic, Meghdad Kurmanji, Lorenzo Sani, Samuel Horváth, William F. Shen, Xinchi Qiu, Nicholas D. Lane.

DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models - joint work with Alex Iacob, Lorenzo Sani, Paris Giampouras, Samuel Horváth, Andrej Jovanovic, Meghdad Kurmanji, Preslav Aleksandrov, William F. Shen, Xinchi Qiu, Nicholas D. Lane.

December 2025

New Position: Assistant Professor (UK Lecturer) at Lancaster University.

I'm excited to announce that I've started my new position as an Assistant Professor (UK Lecturer) in the School of Mathematical Sciences at Lancaster University, joining the MARS (Mathematics for AI in Real-world Systems) section.

November 2025

New paper on arXiv.

CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training - joint work with Soroush Tabesh, Andrei Panferov, Alexandra Volkova, Dan Alistarh.

Abstract: Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with the quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches the accuracy achieved at 4-bits (W4A4) with the prior best method. The official implementation can be found over this https URL.

October 2025

New paper on arXiv.

Abstract: Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP. We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees. Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.

September 2025

Paper accepted to NeurIPS 2025.

Unified Scaling Laws for Compressed Representations - joint work with Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Dan Alistarh.

June 2025

Featured in the ISTA Campus Update newsletter.

June 2025

Paper accepted to TMLR.

GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity - joint work with Artavazd Maranjyan and Peter Richtárik.

June 2025

New paper on arXiv.

Unified Scaling Laws for Compressed Representations - joint work with Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Dan Alistarh.

Abstract: Scaling laws have shaped recent advances in machine learning by enabling predictable scaling of model performance based on model size, computation, and data volume. Concurrently, the rise in computational cost for AI has motivated model compression techniques, notably quantization and sparsification, which have emerged to mitigate the steep computational demands associated with large-scale training and inference. This paper investigates the interplay between scaling laws and compression formats, exploring whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations, such as sparse, scalar-quantized, sparse-quantized or even vector-quantized formats. Our key contributions include validating a general scaling law formulation and showing that it is applicable both individually but also composably across compression types. Based on this, our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric -- based on the representation's ability to fit random Gaussian data -- which can robustly predict parameter efficiency across multiple compressed representations. On the practical side, we extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.

May 2025

New paper on arXiv.

Abstract: Scaling foundation model training with Distributed Data Parallel (DDP) methods is bandwidth-limited. Existing infrequent communication methods like Local SGD were designed to synchronize only model parameters and cannot be trivially applied to adaptive optimizers due to additional optimizer states. Current approaches extending Local SGD either lack convergence guarantees or require synchronizing all optimizer states, tripling communication costs. We propose Desynced Low Communication Adaptive Optimizers (DES-LOC), a family of optimizers assigning independent synchronization periods to parameters and momenta, enabling lower communication costs while preserving convergence. Through extensive experiments on language models of up to 1.7B, we show that DES-LOC can communicate 170x less than DDP and 2x less than the previous state-of-the-art Local ADAM. Furthermore, unlike previous heuristic approaches, DES-LOC is suited for practical training scenarios prone to system failures. DES-LOC offers a scalable, bandwidth-efficient, and fault-tolerant solution for foundation model training.

May 2025

New paper on arXiv.

SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models - joint work with Ionut-Vlad Modoranu, Erik Schultheis, Dan Alistarh.

Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD). However, applying SVD-based procedures individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces. First, we construct a complete orthogonal basis using predefined orthogonal matrices of the Discrete Cosine Transform (DCT). Second, we adaptively select basis columns based on their alignment with the gradient of each layer. Each projection matrix in our method is obtained via a single matrix multiplication followed by a lightweight sorting step to identify the most relevant basis vectors. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. During training, we store only the indices of the selected columns, avoiding the need to store full projection matrices for each layer. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, matching the performance of costly SVD-based methods while achieving faster runtime and reduced memory usage.

January 2025

Paper accepted to ICLR 2025.

LDAdam: Adaptive Optimization from Low-dimensional Gradient Statistics - joint work with Thomas Robert, Ionut-Vlad Modoranu, Dan Alistarh.

October 2024

New paper on arXiv.

LDAdam: Adaptive Optimization from Low-dimensional Gradient Statistics - joint work with Thomas Robert, Ionut-Vlad Modoranu, Dan Alistarh.

Abstract: We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.

October 2024

Secondment with Neural Magic, Inc.

As part of the fellowship, I have started 6-month secondment with Neural Magic, Inc. in the USA, working with Dr. Alexandre Marques on LLM quantization.

September 2024

Two papers accepted to NeurIPS 2024.

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence - joint work with Ionut-Vlad Modoranu, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtárik, Dan Alistarh.

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information - joint work with Diyuan Wu, Ionut-Vlad Modoranu, Denis Kuznedelev, Dan Alistarh.

Mher Safaryan

Bio

Research Interests

Research

News

3 new papers on arXiv.

Paper accepted to MLSys 2026.

3 papers accepted to ICLR 2026.

New Position: Assistant Professor (UK Lecturer) at Lancaster University.

New paper on arXiv.

New paper on arXiv.

Paper accepted to NeurIPS 2025.

Featured in the ISTA Campus Update newsletter.

Paper accepted to TMLR.

New paper on arXiv.

New paper on arXiv.

New paper on arXiv.

Paper accepted to ICLR 2025.

New paper on arXiv.

Secondment with Neural Magic, Inc.

Two papers accepted to NeurIPS 2024.

Contacts

m.{lastname}@lancaster.ac.uk

Fylde College, Bailrigg, Lancaster LA1 4YW, UK