Mher Safaryan

Bio

I am a postdoctoral researcher at the Institute of Science and Technology Austria (ISTA), working in the group led by Prof. Dan Alistarh. I was awarded a Marie Skłodowska-Curie Fellowship through the MSCA COFUND IST-BRIDGE program at ISTA. As part of the fellowship, I had the opportunity to work with Dr. Alexandre Marques during an industrial secondment at Neural Magic, Inc. (now acquired by Red Hat) in the USA.

Before joining ISTA in 2022, I was a postdoctoral research fellow at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia from 2019 to 2022, in the group led by Prof. Peter Richtárik. Prior to that, I worked with Prof. Diogo Gomes at KAUST as a research technician from 2016 to 2019. I obtained my Ph.D. in Mathematics in 2018 from Yerevan State University (YSU) in Armenia, under the supervision of Prof. Grigori Karagulyan.

Research Interests

optimization (theory and algorithms), machine learning, federated learning
large-scale, convex/non-convex, stochastic/deterministic optimization, variance reduction
communication/computation/memory effcient and scalable optimization algorithms
collaborative learning (asynchronous, adversarial, local training, heterogeneity, etc.)
model compression (knowledge distillation, pruning, sparse optimization, quantization)
information theory (compression, encoding schemes, vector quantization)

Research

My current research focuses on optimization theory and algorithms for machine learning, with an emphasis on efficiency, scalability, and the theoretical understanding of optimization methods. These methods are particularly relevant for large-scale machine learning training and federated learning. Driven by applications in machine learning, my publications appear primarily in leading machine learning conferences such as NeurIPS, ICML, AISTATS, and ICLR, as well as journals like JMLR and TMLR.

I completed my Ph.D. in real harmonic analysis, a branch of mathematics that explores the relationship between functions or signals and their frequency domain representations. My thesis investigated the convergence and divergence properties of certain convolution-type integral operators. I defended my thesis in 2018, with a total of five journal publications, including two single-authored papers and two published in The Journal of Geometric Analysis.

In addition, I have done some research in algebra. During my undergraduate studies at YSU, I completed a research project on universal algebraic structures called dimonoids, which led to a publication in Algebra and Discrete Mathematics. Later, at KAUST, I worked on symbolic computation, specifically on developing computer algebra techniques for automating certain aspects of PDE analyses.

For the complete list of my publications, please visit my Google Scholar page.

News

October 2025

New paper on arXiv.

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates - joint work with Alex Iacob, Andrej Jovanovic, Meghdad Kurmanji, Lorenzo Sani, Samuel Horváth, William F. Shen, Xinchi Qiu, Nicholas D. Lane.

Abstract: Training large models with distributed data parallelism (DDP) requires frequent communication of gradients across workers, which can saturate bandwidth. Infrequent communication strategies (e.g., Local SGD) reduce this overhead but, when applied to adaptive optimizers, often suffer a performance gap relative to fully synchronous DDP. We trace this gap to a time-scale mismatch: the optimizer's fast-moving momentum, tuned for frequent updates, decays too quickly to smooth gradients over long intervals, leading to noise-dominated optimization. To address this, we propose MT-DAO, a family of optimizers that employs multiple slow- and fast-moving first momenta or the gradient to track update dynamics across different time scales, for which we provide the first convergence guarantees. Empirically, for language-model pre-training, this eliminates the performance gap with DDP, outperforming infrequent-communication baselines in perplexity and reducing iso-token wall-clock time by 6-27% on Ethernet interconnects. At the 720M scale, MT-DAO reaches a target perplexity in 24% fewer steps and 35% less time than the single-momentum DDP baseline. MT-DAO enables effective cross-datacenter training and training over wide geographic areas.

September 2025

Paper accepted to NeurIPS 2025.

Unified Scaling Laws for Compressed Representations - joint work with Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Dan Alistarh.

June 2025

Featured in the ISTA Campus Update newsletter.

June 2025

Paper accepted to TMLR.

GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity - joint work with Artavazd Maranjyan and Peter Richtárik.

June 2025

New paper on arXiv.

Unified Scaling Laws for Compressed Representations - joint work with Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Dan Alistarh.

Abstract: Scaling laws have shaped recent advances in machine learning by enabling predictable scaling of model performance based on model size, computation, and data volume. Concurrently, the rise in computational cost for AI has motivated model compression techniques, notably quantization and sparsification, which have emerged to mitigate the steep computational demands associated with large-scale training and inference. This paper investigates the interplay between scaling laws and compression formats, exploring whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations, such as sparse, scalar-quantized, sparse-quantized or even vector-quantized formats. Our key contributions include validating a general scaling law formulation and showing that it is applicable both individually but also composably across compression types. Based on this, our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric -- based on the representation's ability to fit random Gaussian data -- which can robustly predict parameter efficiency across multiple compressed representations. On the practical side, we extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.

May 2025

New paper on arXiv.

DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models - joint work with Alex Iacob, Lorenzo Sani, Paris Giampouras, Samuel Horváth, Andrej Jovanovic, Meghdad Kurmanji, Preslav Aleksandrov, William F. Shen, Xinchi Qiu, Nicholas D. Lane.

Abstract: Scaling foundation model training with Distributed Data Parallel (DDP) methods is bandwidth-limited. Existing infrequent communication methods like Local SGD were designed to synchronize only model parameters and cannot be trivially applied to adaptive optimizers due to additional optimizer states. Current approaches extending Local SGD either lack convergence guarantees or require synchronizing all optimizer states, tripling communication costs. We propose Desynced Low Communication Adaptive Optimizers (DES-LOC), a family of optimizers assigning independent synchronization periods to parameters and momenta, enabling lower communication costs while preserving convergence. Through extensive experiments on language models of up to 1.7B, we show that DES-LOC can communicate 170x less than DDP and 2x less than the previous state-of-the-art Local ADAM. Furthermore, unlike previous heuristic approaches, DES-LOC is suited for practical training scenarios prone to system failures. DES-LOC offers a scalable, bandwidth-efficient, and fault-tolerant solution for foundation model training.

May 2025

New paper on arXiv.

SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models - joint work with Ionut-Vlad Modoranu, Erik Schultheis, Dan Alistarh.

Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD). However, applying SVD-based procedures individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces. First, we construct a complete orthogonal basis using predefined orthogonal matrices of the Discrete Cosine Transform (DCT). Second, we adaptively select basis columns based on their alignment with the gradient of each layer. Each projection matrix in our method is obtained via a single matrix multiplication followed by a lightweight sorting step to identify the most relevant basis vectors. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. During training, we store only the indices of the selected columns, avoiding the need to store full projection matrices for each layer. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, matching the performance of costly SVD-based methods while achieving faster runtime and reduced memory usage.

January 2025

Paper accepted to ICLR 2025.

LDAdam: Adaptive Optimization from Low-dimensional Gradient Statistics - joint work with Thomas Robert, Ionut-Vlad Modoranu, Dan Alistarh.

October 2024

New paper on arXiv.

LDAdam: Adaptive Optimization from Low-dimensional Gradient Statistics - joint work with Thomas Robert, Ionut-Vlad Modoranu, Dan Alistarh.

Abstract: We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.

October 2024

Secondment with Neural Magic, Inc.

As part of the fellowship, I have started 6-month secondment with Neural Magic, Inc. in the USA, working with Dr. Alexandre Marques on LLM quantization.

September 2024

Two papers accepted to NeurIPS 2024.

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence - joint work with Ionut-Vlad Modoranu, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtárik, Dan Alistarh.

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information - joint work with Diyuan Wu, Ionut-Vlad Modoranu, Denis Kuznedelev, Dan Alistarh.

Mher Safaryan

Bio

Research Interests

Research

News

New paper on arXiv.

Paper accepted to NeurIPS 2025.

Featured in the ISTA Campus Update newsletter.

Paper accepted to TMLR.

New paper on arXiv.

New paper on arXiv.

New paper on arXiv.

Paper accepted to ICLR 2025.

New paper on arXiv.

Secondment with Neural Magic, Inc.

Two papers accepted to NeurIPS 2024.

Contacts

mher.safaryan@ist.ac.at

Building West, Level 1, 21-01-122, ISTA, Am Campus 1, 3400 Klosterneuburg, Austria