Bio

I am a postdoctoral researcher at the Institute of Science and Technology Austria (ISTA), working in the group led by Prof. Dan Alistarh. I was awarded a Marie Skłodowska-Curie Fellowship through the MSCA COFUND IST-BRIDGE program at ISTA. As part of the fellowship, I had the opportunity to work with Dr. Alexandre Marques during an industrial secondment at Neural Magic, Inc. (now acquired by Red Hat) in the USA.

Before joining ISTA in 2022, I was a postdoctoral research fellow at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia from 2019 to 2022, in the group led by Prof. Peter Richtárik. Prior to that, I worked with Prof. Diogo Gomes at KAUST as a research technician from 2016 to 2019. I obtained my Ph.D. in Mathematics in 2018 from Yerevan State University (YSU) in Armenia, under the supervision of Prof. Grigori Karagulyan.

Research Interests

  • optimization (theory and algorithms), machine learning, federated learning
  • large-scale, convex/non-convex, stochastic/deterministic optimization, variance reduction
  • communication/computation/memory effcient and scalable optimization algorithms
  • collaborative learning (asynchronous, adversarial, local training, heterogeneity, etc.)
  • model compression (knowledge distillation, pruning, sparse optimization, quantization)
  • information theory (compression, encoding schemes, vector quantization)

Research

My current research focuses on optimization theory and algorithms for machine learning, with an emphasis on efficiency, scalability, and the theoretical understanding of optimization methods. These methods are particularly relevant for large-scale machine learning training and federated learning. Driven by applications in machine learning, my publications appear primarily in leading machine learning conferences such as NeurIPS, ICML, AISTATS, and ICLR, as well as journals like JMLR and TMLR.

I completed my Ph.D. in real harmonic analysis, a branch of mathematics that explores the relationship between functions or signals and their frequency domain representations. My thesis investigated the convergence and divergence properties of certain convolution-type integral operators. I defended my thesis in 2018, with a total of five journal publications, including two single-authored papers and two published in The Journal of Geometric Analysis.

In addition, I have done some research in algebra. During my undergraduate studies at YSU, I completed a research project on universal algebraic structures called dimonoids, which led to a publication in Algebra and Discrete Mathematics. Later, at KAUST, I worked on symbolic computation, specifically on developing computer algebra techniques for automating certain aspects of PDE analyses.

For the complete list of my publications, please visit my Google Scholar page.

News

June 2025

New paper on arXiv.

Unified Scaling Laws for Compressed Representations - joint work with Andrei Panferov, Alexandra Volkova, Ionut-Vlad Modoranu, Vage Egiazarian, Dan Alistarh.

Abstract: Scaling laws have shaped recent advances in machine learning by enabling predictable scaling of model performance based on model size, computation, and data volume. Concurrently, the rise in computational cost for AI has motivated model compression techniques, notably quantization and sparsification, which have emerged to mitigate the steep computational demands associated with large-scale training and inference. This paper investigates the interplay between scaling laws and compression formats, exploring whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations, such as sparse, scalar-quantized, sparse-quantized or even vector-quantized formats. Our key contributions include validating a general scaling law formulation and showing that it is applicable both individually but also composably across compression types. Based on this, our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric -- based on the representation's ability to fit random Gaussian data -- which can robustly predict parameter efficiency across multiple compressed representations. On the practical side, we extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.

May 2025

New paper on arXiv.

DES-LOC: Desynced Low Communication Adaptive Optimizers for Training Foundation Models - joint work with Alex Iacob, Lorenzo Sani, Paris Giampouras, Samuel Horváth, Andrej Jovanovic, Meghdad Kurmanji, Preslav Aleksandrov, William F. Shen, Xinchi Qiu, Nicholas D. Lane.

Abstract: Scaling foundation model training with Distributed Data Parallel (DDP) methods is bandwidth-limited. Existing infrequent communication methods like Local SGD were designed to synchronize only model parameters and cannot be trivially applied to adaptive optimizers due to additional optimizer states. Current approaches extending Local SGD either lack convergence guarantees or require synchronizing all optimizer states, tripling communication costs. We propose Desynced Low Communication Adaptive Optimizers (DES-LOC), a family of optimizers assigning independent synchronization periods to parameters and momenta, enabling lower communication costs while preserving convergence. Through extensive experiments on language models of up to 1.7B, we show that DES-LOC can communicate 170x less than DDP and 2x less than the previous state-of-the-art Local ADAM. Furthermore, unlike previous heuristic approaches, DES-LOC is suited for practical training scenarios prone to system failures. DES-LOC offers a scalable, bandwidth-efficient, and fault-tolerant solution for foundation model training.

May 2025

New paper on arXiv.

SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models - joint work with Ionut-Vlad Modoranu, Erik Schultheis, Dan Alistarh.

Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD). However, applying SVD-based procedures individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple two-step procedure to approximate SVD-based gradient projections into lower-dimensional spaces. First, we construct a complete orthogonal basis using predefined orthogonal matrices of the Discrete Cosine Transform (DCT). Second, we adaptively select basis columns based on their alignment with the gradient of each layer. Each projection matrix in our method is obtained via a single matrix multiplication followed by a lightweight sorting step to identify the most relevant basis vectors. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. During training, we store only the indices of the selected columns, avoiding the need to store full projection matrices for each layer. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, matching the performance of costly SVD-based methods while achieving faster runtime and reduced memory usage.

October 2024

New paper on arXiv.

LDAdam: Adaptive Optimization from Low-dimensional Gradient Statistics - joint work with Thomas Robert, Ionut-Vlad Modoranu, Dan Alistarh.

Abstract: We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.

October 2024

Secondment with Neural Magic, Inc.

As part of the fellowship, I have started 6-month secondment with Neural Magic, Inc. in the USA, working with Dr. Alexandre Marques on LLM quantization.

Contacts

Building West, Level 1, 21-01-122, ISTA, Am Campus 1, 3400 Klosterneuburg, Austria