new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jul 3

Mediocrity is the key for LLM as a Judge Anchor Selection

The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

  • 4 authors
·
Mar 17

On Randomness in Agentic Evals

Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks. Most papers report a pass@1 score computed from a single run per task, assuming this gives a reliable performance estimate. We test this assumption by collecting 60,000 agentic trajectories on SWE-Bench-Verified, spanning three models and two scaffolds. We find substantial variance: single-run pass@1 estimates vary by 2.2 to 6.0 percentage points depending on which run is selected, with standard deviations exceeding 1.5 percentage points even at temperature 0. This variance has critical implications: reported improvements of 2--3 percentage points may reflect evaluation noise rather than genuine algorithmic progress. Through token-level analysis, we show that trajectories diverge early, often within the first few percent of tokens, and that these small differences cascade into different solution strategies. To enable reliable evaluation of agentic systems, we recommend three concrete practices: (1) estimate pass@1 from multiple independent runs per task, especially when measuring small improvements, (2) use statistical power analysis to determine the number of runs needed to detect expected effect sizes, and (3) consider metrics like pass@k (optimistic bound) and pass^k (pessimistic bound) with k>1 to better characterize the full performance envelope. While these practices increase evaluation cost, they are essential for distinguishing genuine scientific progress from statistical noise.

Full-Shape analysis of the power spectrum and bispectrum of DESI DR1 LRG and QSO samples

We present the first joint analysis of the power spectrum and bispectrum using the Data Release 1 (DR1) of the Dark Energy Spectroscopic Instrument (DESI), focusing on Luminous Red Galaxies (LRGs) and quasars (QSOs) across a redshift range of 0.4leq zleq2.1. By combining the two- and three-point statistics, we are able to partially break the degeneracy between the logarithmic growth rate, f(z), and the amplitude of dark matter fluctuations, σ_s8(z), which cannot be measured separately in analyses that only involve the power spectrum. In comparison with the (fiducial) Planck ΛCDM cosmology we obtain f/f^fid={0.888_{-0.089}^{+0.186},0.977_{-0.220}^{+0.182},1.030_{-0.085}^{+0.368}}, σ_{s8}/σ^fid_s8={1.224_{-0.133}^{+0.091},1.071_{-0.163}^{+0.278},1.000_{-0.223}^{+0.088}} respectively for the three LRG redshift bins, corresponding to a cumulative 10.1\% constraint on f, and of 8.4\% on σ_s8, including the systematic error budget. The cumulative constraints for the ShapeFit compressed parameters from our joint power spectrum-bispectrum analysis are respectively σ_{α_iso}=0.9% (9\% improvement with respect to our power spectrum-only analysis); σ_{α_AP}=2.3% (no improvement with respect to power spectrum-only analysis, which is expected given that the bispectrum monopole has no significant anisotropic signal); σ_{fσ_s8}=5.1% (9\% improvement); σ_{m+n}=2.3% (11\% improvement). These results are fully consistent with the main DESI power spectrum analysis, demonstrating the robustness of the DESI cosmological constraints, and compatible with Planck ΛCDM cosmology.

  • 69 authors
·
Jun 5, 2025

Optimised angular power spectra for spectroscopic galaxy surveys

The angular power spectrum is a gauge-independent observable that is in principle the natural tool for analysing galaxy number counts. In practice, the problem is that the computational requirements for next-generation spectroscopic surveys such as Euclid and the Square Kilometre Array are currently unfeasible. We propose a new method to save computational time for spectroscopic angular power spectra. This hybrid method is modelled on the Fourier power spectrum approach of treating relatively thick redshift bins (redshift width ~0.1) as separate surveys. In the hybrid method, each thick bin is further subdivided into thin bins (redshift width ~0.01); all the correlations within each thick bin are computed, while cross-bin correlations beyond the thick bins are neglected. Constraints on cosmological parameters from the hybrid method are comparable to those from the standard galaxy power spectrum analysis - but they have the advantage that cosmic evolution, wide-angle and lensing effects are naturally included, while no Alcock-Paczynski correction is needed. The hybrid method delivers much tighter constraints than a 2D tomographic approach that is typical for photometric surveys, which considers only thick bins and the correlations between them. Furthermore, for standard cosmological parameters our method is not biased by neglecting the effects of lensing on number counts, while the tomographic method is strongly biased.

  • 4 authors
·
Mar 28, 2018

The NANOGrav Nine-year Data Set: Limits on the Isotropic Stochastic Gravitational Wave Background

We compute upper limits on the nanohertz-frequency isotropic stochastic gravitational wave background (GWB) using the 9-year data release from the North American Nanohertz Observatory for Gravitational Waves (NANOGrav) collaboration. We set upper limits for a GWB from supermassive black hole binaries under power law, broken power law, and free spectral coefficient GW spectrum models. We place a 95\% upper limit on the strain amplitude (at a frequency of yr^{-1}) in the power law model of A_{rm gw} < 1.5times 10^{-15}. For a broken power law model, we place priors on the strain amplitude derived from simulations of Sesana (2013) and McWilliams et al. (2014). We find that the data favor a broken power law to a pure power law with odds ratios of 22 and 2.2 to one for the McWilliams and Sesana prior models, respectively. The McWilliams model is essentially ruled out by the data, and the Sesana model is in tension with the data under the assumption of a pure power law. Using the broken power-law analysis we construct posterior distributions on environmental factors that drive the binary to the GW-driven regime including the stellar mass density for stellar-scattering, mass accretion rate for circumbinary disk interaction, and orbital eccentricity for eccentric binaries, marking the first time that the shape of the GWB spectrum has been used to make astrophysical inferences. We then place the most stringent limits so far on the energy density of relic GWs, Omega_gw(f),h^2 < 4.2 times 10^{-10}, yielding a limit on the Hubble parameter during inflation of H_*=1.6times10^{-2}~m_{Pl}, where m_{Pl} is the Planck mass. Our limit on the cosmic string GWB, Omega_gw(f), h^2 < 2.2 times 10^{-10}, translates to a conservative limit of Gmu<3.3times 10^{-8} - a factor of 4 better than the joint Planck and high-l CMB data from other experiments.

  • 48 authors
·
Aug 12, 2015

Open High-Resolution Satellite Imagery: The WorldStrat Dataset -- With Application to Super-Resolution

Analyzing the planet at scale with satellite imagery and machine learning is a dream that has been constantly hindered by the cost of difficult-to-access highly-representative high-resolution imagery. To remediate this, we introduce here the WorldStrat dataset. The largest and most varied such publicly available dataset, at Airbus SPOT 6/7 satellites' high resolution of up to 1.5 m/pixel, empowered by European Space Agency's Phi-Lab as part of the ESA-funded QueryPlanet project, we curate nearly 10,000 sqkm of unique locations to ensure stratified representation of all types of land-use across the world: from agriculture to ice caps, from forests to multiple urbanization densities. We also enrich those with locations typically under-represented in ML datasets: sites of humanitarian interest, illegal mining sites, and settlements of persons at risk. We temporally-match each high-resolution image with multiple low-resolution images from the freely accessible lower-resolution Sentinel-2 satellites at 10 m/pixel. We accompany this dataset with an open-source Python package to: rebuild or extend the WorldStrat dataset, train and infer baseline algorithms, and learn with abundant tutorials, all compatible with the popular EO-learn toolbox. We hereby hope to foster broad-spectrum applications of ML to satellite imagery, and possibly develop from free public low-resolution Sentinel2 imagery the same power of analysis allowed by costly private high-resolution imagery. We illustrate this specific point by training and releasing several highly compute-efficient baselines on the task of Multi-Frame Super-Resolution. High-resolution Airbus imagery is CC BY-NC, while the labels and Sentinel2 imagery are CC BY, and the source code and pre-trained models under BSD. The dataset is available at https://zenodo.org/record/6810791 and the software package at https://github.com/worldstrat/worldstrat .

  • 3 authors
·
May 30, 2025

A mechanism to generate varying speed of light via Higgs-dilaton coupling: Theory and cosmological applications

We allow the Higgs field Phi to interact with a dilaton field chi of the background spacetime via the coupling chi^2,Phi^daggerPhi. Upon spontaneous gauge symmetry breaking, the Higgs VEV becomes proportional to chi. While traditionally this linkage is employed to make the Planck mass and particle masses dependent on chi, we present an textit alternative mechanism: the Higgs VEV will be used to construct Planck's constant hbar and speed of light c. Specifically, each open set vicinity of a given point x^* on the spacetime manifold is equipped with a replica of the Glashow-Weinberg-Salam action operating with its own effective values of hbar_* and c_* per hbar_*proptochi^{-1/2}(x^*) and c_*proptochi^{1/2}(x^*), causing these ``fundamental constants'' to vary alongside the dynamical field chi. Moreover, in each open set around x^*, the prevailing value chi(x^*) determines the length and time scales for physical processes occurring in this region as lproptochi^{-1}(x^*) and tauproptochi^{-3/2}(x^*). This leads to an textit anisotropic relation tau^{-1}propto l^{-3/2} between the rate of clocks and the length of rods, resulting in a distinct set of novel physical phenomena. For late-time cosmology, the variation of c along the trajectory of light waves from distant supernovae towards the Earth-based observer necessitates modifications to the Lema\^itre redshift relation and the Hubble law. These modifications are capable of: (1) Accounting for the Pantheon Catalog of SNeIa through a declining speed of light in an expanding Einstein--de Sitter universe, thus avoiding the need for dark energy; (2) Revitalizing Blanchard-Douspis-Rowan-Robinson-Sarkar's CMB power spectrum analysis that bypassed dark energy [A&A 412, 35 (2003)]; and (3) Resolving the H_0 tension without requiring a dynamical dark energy component.

  • 1 authors
·
Aug 5, 2024

Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL's efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.

  • 3 authors
·
Jul 28, 2025 4

PLUTO: Pathology-Universal Transformer

Pathology is the study of microscopic inspection of tissue, and a pathology diagnosis is often the medical gold standard to diagnose disease. Pathology images provide a unique challenge for computer-vision-based analysis: a single pathology Whole Slide Image (WSI) is gigapixel-sized and often contains hundreds of thousands to millions of objects of interest across multiple resolutions. In this work, we propose PathoLogy Universal TransfOrmer (PLUTO): a light-weight pathology FM that is pre-trained on a diverse dataset of 195 million image tiles collected from multiple sites and extracts meaningful representations across multiple WSI scales that enable a large variety of downstream pathology tasks. In particular, we design task-specific adaptation heads that utilize PLUTO's output embeddings for tasks which span pathology scales ranging from subcellular to slide-scale, including instance segmentation, tile classification, and slide-level prediction. We compare PLUTO's performance to other state-of-the-art methods on a diverse set of external and internal benchmarks covering multiple biologically relevant tasks, tissue types, resolutions, stains, and scanners. We find that PLUTO matches or outperforms existing task-specific baselines and pathology-specific foundation models, some of which use orders-of-magnitude larger datasets and model sizes when compared to PLUTO. Our findings present a path towards a universal embedding to power pathology image analysis, and motivate further exploration around pathology foundation models in terms of data diversity, architectural improvements, sample efficiency, and practical deployability in real-world applications.

  • 33 authors
·
May 13, 2024

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

Operational weather prediction at kilometer scales remains computationally prohibitive for traditional numerical weather prediction (NWP) models, limiting forecast access for applications in energy, agriculture, and disaster management that require fine-grained spatiotemporal detail. Here we introduce AirCast-SR, a foundation model for atmospheric super-resolution that downscales global AI weather forecasts from 0.25 degree (~28 km) to 1 km horizontal resolution at hourly temporal resolution, producing 67-hour forecasts of eight coupled surface variables simultaneously. EarthMind-SR employs a three-dimensional U-Net conditioned within a Latent Consistency Model (LCM) diffusion framework, trained on patch-based samples over the contiguous United States (CONUS) using GraphCast forecasts as input and NOAA's Analysis of Record for Calibration (AORC) as the target. The model achieves near-zero bias across all variables and lead times, and its radial power spectral density analysis demonstrates preservation of fine-scale atmospheric structure at wavelengths of 10 km to 100 km where coarser models lose spectral power. We validate EarthMind-SR across three CONUS case studies spanning winter, summer, and spring seasons, and demonstrate zero-shot global transferability over India and Germany using independent surface station observations without any retraining or fine-tuning. As an open-weights foundation model, EarthMind-SR establishes a new paradigm for kilometer-scale AI weather prediction and provides a platform for regional fine-tuning, distillation, and downstream applications in climate services and hazard forecasting.

  • 14 authors
·
May 19

A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends

Deep learning has solved a problem that as little as five years ago was thought by many to be intractable - the automatic recognition of patterns in data; and it can do so with accuracy that often surpasses human beings. It has solved problems beyond the realm of traditional, hand-crafted machine learning algorithms and captured the imagination of practitioners trying to make sense out of the flood of data that now inundates our society. As public awareness of the efficacy of DL increases so does the desire to make use of it. But even for highly trained professionals it can be daunting to approach the rapidly increasing body of knowledge produced by experts in the field. Where does one start? How does one determine if a particular model is applicable to their problem? How does one train and deploy such a network? A primer on the subject can be a good place to start. With that in mind, we present an overview of some of the key multilayer ANNs that comprise DL. We also discuss some new automatic architecture optimization protocols that use multi-agent approaches. Further, since guaranteeing system uptime is becoming critical to many computer applications, we include a section on using neural networks for fault detection and subsequent mitigation. This is followed by an exploratory survey of several application areas where DL has emerged as a game-changing technology: anomalous behavior detection in financial applications or in financial time-series forecasting, predictive and prescriptive analytics, medical image processing and analysis and power systems research. The thrust of this review is to outline emerging areas of application-oriented research within the DL community as well as to provide a reference to researchers seeking to use it in their work for what it does best: statistical pattern recognition with unparalleled learning capacity with the ability to scale with information.

  • 8 authors
·
May 30, 2019

Unbiased analysis of primordial non-Gaussianity: the multipoles of the full relativistic power spectrum

A major goal of ongoing and future cosmological surveys of the large-scale structure is to measure local type primordial non-Gaussianity in the galaxy power spectrum through the scale-dependent bias. General relativistic effects have been shown to be degenerate with this measurement, therefore requiring a non-Newtonian approach. In this work, we develop a consistent framework to compute integrated effects, including lensing convergence, time delay, and integrated Sachs--Wolfe, along with the local relativistic projection and wide-separation corrections in the multipoles of the power spectrum. We show that, for a Euclid-like Hα-line galaxy survey and a MegaMapper-like Lyman-break galaxy survey, ignoring these effects leads to a bias on the best fit measurement of the amplitude of primordial non-Gaussianity, f_{rm NL}, of around 3,σ and 20 , σ respectively. When we include these corrections, the uncertainty in our knowledge of the luminosity function leads to further uncertainty in our measurement of f_{rm NL}. In this work, we show that this degeneracy can be partly mitigated by using a bright-faint multi-tracer analysis, where the observed galaxy sample is subdivided into two separate populations based on luminosity, which provides a 15--20% improvement on the forecasted constraints of local type f_{rm NL}. In addition, we present a novel calculation of the full multi-tracer covariance with the inclusion of wide-separation corrections~-- all of these results are implemented in the Python code CosmoWAP.

  • 8 authors
·
Jun 17

Convergence of Iterative Water-Filling in Multi-User Non-Cooperative Power Control: A Comprehensive Analysis for Sequential, Simultaneous, and Asynchronous Schemes

Non-cooperative game theory provides a robust framework for analyzing distributed resource allocation in multi-user wireless networks, with Iterative Water-Filling (IWF) emerging as a canonical solution for power control problems. Although classical fixed-point theorems guarantee the existence of a Nash Equilibrium (NE) under mild concavity and compactness conditions, the convergence of practical iterative algorithms to that equilibrium remains a challenging endeavor. This challenge intensifies under varying update schedules, interference regimes, and imperfections such as channel estimation errors or feedback delay. In this paper, we present an in-depth examination of IWF in multi-user systems under three different update schemes: (1) synchronous sequential updates, (2) synchronous simultaneous updates, and (3) totally asynchronous updates. We first formulate the water-filling operator in a multi-carrier environment, then recast the iterative process as a fixed-point problem. Using contraction mapping principles, we demonstrate sufficient conditions under which IWF converges to a unique NE and highlight how spectral radius constraints, diagonal dominance, and careful step-size selection are pivotal for guaranteeing convergence. We further discuss robustness to measurement noise, partial updates, and network scaling to emphasize the practical viability of these schemes. This comprehensive analysis unifies diverse threads in the literature while offering novel insights into asynchronous implementations. Our findings enable network designers to ascertain system parameters that foster both stable convergence and efficient spectrum usage.

  • 1 authors
·
Feb 17, 2025

Analysis and Applications of Deep Learning with Finite Samples in Full Life-Cycle Intelligence of Nuclear Power Generation

The advent of Industry 4.0 has precipitated the incorporation of Artificial Intelligence (AI) methods within industrial contexts, aiming to realize intelligent manufacturing, operation as well as maintenance, also known as industrial intelligence. However, intricate industrial milieus, particularly those relating to energy exploration and production, frequently encompass data characterized by long-tailed class distribution, sample imbalance, and domain shift. These attributes pose noteworthy challenges to data-centric Deep Learning (DL) techniques, crucial for the realization of industrial intelligence. The present study centers on the intricate and distinctive industrial scenarios of Nuclear Power Generation (NPG), meticulously scrutinizing the application of DL techniques under the constraints of finite data samples. Initially, the paper expounds on potential employment scenarios for AI across the full life-cycle of NPG. Subsequently, we delve into an evaluative exposition of DL's advancement, grounded in the finite sample perspective. This encompasses aspects such as small-sample learning, few-shot learning, zero-shot learning, and open-set recognition, also referring to the unique data characteristics of NPG. The paper then proceeds to present two specific case studies. The first revolves around the automatic recognition of zirconium alloy metallography, while the second pertains to open-set recognition for signal diagnosis of machinery sensors. These cases, spanning the entirety of NPG's life-cycle, are accompanied by constructive outcomes and insightful deliberations. By exploring and applying DL methodologies within the constraints of finite sample availability, this paper not only furnishes a robust technical foundation but also introduces a fresh perspective toward the secure and efficient advancement and exploitation of this advanced energy source.

  • 11 authors
·
Nov 7, 2023

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities, primarily due to the exceptional in-context understanding and multi-task learning strengths of large language models (LLMs). The advent of visual instruction tuning has further enhanced MLLMs' performance in vision-language understanding. However, while existing MLLMs adeptly recognize what objects are in an image, they still face challenges in effectively discerning where these objects are, particularly along the distance (scene depth) axis. To overcome this limitation in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images. The framework operates in two phases: the first phase focuses on guiding the models to understand the relative depth of objects, and the second phase further encourages the models to infer the proximity relationships between objects based on their depth perceptions. We also propose a VQA dataset called Proximity-110K, containing additional instructions that incorporate depth information and the proximity relationships of objects. We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis, outperforming other state-of-the-art MLLMs. Code and dataset will be released at magenta{https://github.com/NorthSummer/ProximityQA.git}.

  • 5 authors
·
Jan 31, 2024

LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

Formal specification is essential for rigorous program verification, yet writing correct specifications remains costly and difficult to automate. Although large language models (LLMs) and agents have shown promising progress, their true capabilities and failure modes remain unclear. We present the first systematic and contamination-aware study of LLM- and agent-based formal specification generation for C programs. We introduce LiveFMBench, a continuously evolving benchmark of 630 ACSL (ANSI/ISO C Specification Language)-annotated C programs, including 360 newly collected cases designed to mitigate data leakage. Using this benchmark, we evaluate direct prompting with different sampling sizes, reasoning-enabled (thinking mode) inference, the agentic pipeline, and perform a fine-grained failure analysis. Experimental results reveal that naive evaluation substantially overestimates performance because models under direct prompting may exhibit unfaithful behaviors, such as deceiving automated provers or ignoring code-context constraints; after excluding such cases, the true specification generation accuracy drops by approximately 20\%. We further find that both increased sampling and thinking mode significantly improve success rates, with smaller models benefiting more from thinking mode. Agentic pipelines are particularly effective under low sampling budgets and on harder datasets. Failure analysis further shows that incorrect loop invariants are the dominant error type, while agentic pipelines notably reduce assertion errors. These results expose fundamental limitations in current LLM-based approaches and suggest they remain far from replacing human-authored formal specifications. We release LiveFMBench at https://huggingface.co/datasets/fm-universe/Live-FM-Bench and all evaluation artifacts to support future research.

  • 12 authors
·
May 1

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models

Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks, particularly for short videos. However, as the length of the video increases, generating long, detailed captions becomes a significant challenge. In this paper, we investigate the limitations of LMMs in generating long captions for long videos. Our analysis reveals that open-source LMMs struggle to consistently produce outputs exceeding 300 words, leading to incomplete or overly concise descriptions of the visual content. This limitation hinders the ability of LMMs to provide comprehensive and detailed captions for long videos, ultimately missing important visual information. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model's output length. However, manually annotating long-caption examples for long-form videos is time-consuming and expensive. To overcome the annotation bottleneck, we propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation. % aggregating multi-level descriptions. Using LongCaption-Agent, we curated a new long-caption dataset, LongCaption-10K. We also develop LongCaption-Bench, a benchmark designed to comprehensively evaluate the quality of long captions generated by LMMs. By incorporating LongCaption-10K into training, we enable LMMs to generate captions exceeding 1,000 words for long-form videos, while maintaining high output quality. In LongCaption-Bench, our model achieved State-of-The-Art performance, even surpassing larger proprietary models like GPT4o.

  • 5 authors
·
Feb 21, 2025

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.

  • 10 authors
·
Jun 22, 2025 1

Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers

While Diffusion Transformers (DiTs) have achieved breakthroughs in video generation, this long sequence generation task remains constrained by the quadratic complexity of attention mechanisms, resulting in significant inference latency. Through detailed analysis of attention maps in Video Diffusion Transformer (vDiT), we identify three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. And even 3-6\% attention heads can be skipped. Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. 2) An offline sparse diffusion search algorithm that selects the optimal sparse computation strategy per layer and head via hardware-aware cost modeling. After determining the optimal configuration, we fuse heads within the same layer that share the same attention strategy, enhancing inference efficiency. Integrated into state-of-the-art vDiT models (CogVideoX1.5, HunyuanVideo, and Wan2.1), Sparse-vDiT achieves 2.09times, 2.38times, and 1.67times theoretical FLOP reduction, and actual inference speedups of 1.76times, 1.85times, and 1.58times, respectively, while maintaining high visual fidelity, with PSNR values reaching 24.13, 27.09, and 22.59. Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.

  • 8 authors
·
Jun 3, 2025 2

Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem

Since 2019, the Hugging Face Model Hub has been the primary global platform for sharing open weight AI models. By releasing a dataset of the complete history of weekly model downloads (June 2020-August 2025) alongside model metadata, we provide the most rigorous examination to-date of concentration dynamics and evolving characteristics in the open model economy. Our analysis spans 851,000 models, over 200 aggregated attributes per model, and 2.2B downloads. We document a fundamental rebalancing of economic power: US open-weight industry dominance by Google, Meta, and OpenAI has declined sharply in favor of unaffiliated developers, community organizations, and, as of 2025, Chinese industry, with DeepSeek and Qwen models potentially heralding a new consolidation of market power. We identify statistically significant shifts in model properties, a 17X increase in average model size, rapid growth in multimodal generation (3.4X), quantization (5X), and mixture-of-experts architectures (7X), alongside concerning declines in data transparency, with open weights models surpassing truly open source models for the first time in 2025. We expose a new layer of developer intermediaries that has emerged, focused on quantizing and adapting base models for both efficiency and artistic expression. To enable continued research and oversight, we release the complete dataset with an interactive dashboard for real-time monitoring of concentration dynamics and evolving properties in the open model economy.

economies-open-ai Economies
·
Nov 27, 2025 2

Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation

Despite recent progress made by large language models in code generation, they still struggle with programs that meet complex requirements. Recent work utilizes plan-and-solve decomposition to decrease the complexity and leverage self-tests to refine the generated program. Yet, planning deep-inside requirements in advance can be challenging, and the tests need to be accurate to accomplish self-improvement. To this end, we propose FunCoder, a code generation framework incorporating the divide-and-conquer strategy with functional consensus. Specifically, FunCoder recursively branches off sub-functions as smaller goals during code generation, represented by a tree hierarchy. These sub-functions are then composited to attain more complex objectives. Additionally, we designate functions via a consensus formed by identifying similarities in program behavior, mitigating error propagation. FunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4. Moreover, our method demonstrates superiority on smaller models: With FunCoder, StableCode-3b surpasses GPT-3.5 by +18.6% and achieves 97.7% of GPT-4's performance on HumanEval. Further analysis reveals that our proposed dynamic function decomposition is capable of handling complex requirements, and the functional consensus prevails over self-testing in correctness evaluation.

  • 7 authors
·
May 30, 2024

Analysis of Two Models for the Angular Structure of the Outflows Producing the Swift/XRT "Larger-Angle Emission" of Gamma-Ray Bursts

The instantaneous emission from a relativistic surface endowed with a Lorentz factor Gamma that decreases away from the outflow symmetry axis can naturally explain the three phases observed by Swift/XRT in GRBs and their afterglows (GRB tail, afterglow plateau and post-plateau). We expand the analytical formalism of the "Larger-Angle Emission" model previously developed for "Power-Law" outflows to "n-Exponential" outflows (e.g. exponential with n=1 and Gaussian with n=2) and compare their abilities to account for the X-ray emission of XRT afterglows. We assume power-law Gamma-dependences of two spectral characteristics (peak-energy and peak intensity) and find that, unlike Power-Law outflows, n-Exponential outflows cannot account for plateaus with a temporal dynamical range larger than 100. To include all information existing in the Swift/XRT measurements of X-ray aferglows (0.3-10 keV unabsorbed flux and effective spectral slope), we calculate 0.3 keV and 10 keV light-curves using a broken power-law emission spectrum of peak-energy and low-and high-energy slopes that are derived from the effective slope measured by XRT. This economical peak-energy determination is found to be consistent with more expensive spectral fits. The angular distributions of the Lorentz factor, comoving frame peak-energy, and peak-intensity (Gamma (theta), E'_p (theta), i'_p(theta)) constrain the (yet-to-be determined) convolution of various features of the production of relativistic jets by solar-mass black-holes and of their propagation through the progenitor/circumburst medium, while the E'_p (Gamma) and i'_p (Gamma) dependences may constrain the GRB dissipation mechanism and the GRB emission process.

  • 1 authors
·
May 9, 2025

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Recently, multi-modal content generation has attracted lots of attention from researchers by investigating the utilization of visual instruction tuning based on large language models (LLMs). To enhance the performance and generalization ability of such LLMs, the practice of distilling knowledge from pretrained multi-modal models (a.k.a. teachers) to more compact multi-modal LLMs (students) has gained considerable interest. However, the prevailing paradigm of instructiontuning in multi-modal LLMs knowledge distillation is resource-intensive and unidirectional, neglecting the potential for mutual feedback between the student and teacher models. Thus, we propose an innovative Competitive Multi-modal Distillation framework (CoMD), which captures bidirectional feedback between teacher and student models and continually updates the multi-modal capabilities that the student model has learned. It comprises two stages: multi-modal pre-training and multi-modal competitive distillation. The first stage pre-trains the student model on a large number of filtered multi-modal datasets. The second stage facilitates a bidirectional knowledge transfer between the student and teacher models. Our experimental analysis of diverse datasets shows that our knowledge transfer method consistently improves the capabilities of the student model. Finally, the 7B-sized student model after four distillations surpassed the current state-of-the-art model LLaVA-13B on the ScienceQA and LLaVA Test dataset, also outperforms other strong baselines in the zero-shot setting.

  • 4 authors
·
Nov 14, 2023

POLCA: Power Oversubscription in LLM Cloud Providers

Recent innovation in large language models (LLMs), and their myriad use-cases have rapidly driven up the compute capacity demand for datacenter GPUs. Several cloud providers and other enterprises have made substantial plans of growth in their datacenters to support these new workloads. One of the key bottleneck resources in datacenters is power, and given the increasing model sizes of LLMs, they are becoming increasingly power intensive. In this paper, we show that there is a significant opportunity to oversubscribe power in LLM clusters. Power oversubscription improves the power efficiency of these datacenters, allowing more deployable servers per datacenter, and reduces the deployment time, since building new datacenters is slow. We extensively characterize the power consumption patterns of a variety of LLMs and their configurations. We identify the differences between the inference and training power consumption patterns. Based on our analysis of these LLMs, we claim that the average and peak power utilization in LLM clusters for inference should not be very high. Our deductions align with the data from production LLM clusters, revealing that inference workloads offer substantial headroom for power oversubscription. However, the stringent set of telemetry and controls that GPUs offer in a virtualized environment, makes it challenging to have a reliable and robust power oversubscription mechanism. We propose POLCA, our framework for power oversubscription that is robust, reliable, and readily deployable for GPU clusters. Using open-source models to replicate the power patterns observed in production, we simulate POLCA and demonstrate that we can deploy 30% more servers in the same GPU cluster for inference, with minimal performance loss

  • 7 authors
·
Aug 24, 2023

EXAdam: The Power of Adaptive Cross-Moments

This paper introduces EXAdam (EXtended Adam), a novel optimization algorithm that builds upon the widely-used Adam optimizer. EXAdam incorporates three key enhancements: (1) new debiasing terms for improved moment estimation, (2) a gradient-based acceleration mechanism for increased responsiveness to the current loss landscape, and (3) a dynamic step size formula that allows for continuous growth of the learning rate throughout training. These innovations work synergistically to address limitations of the original Adam algorithm, potentially offering improved convergence properties, enhanced ability to escape saddle points, and greater robustness to hyperparameter choices. I provide a theoretical analysis of EXAdam's components and their interactions, highlighting the algorithm's potential advantages in navigating complex optimization landscapes. Empirical evaluations demonstrate EXAdam's superiority over Adam, achieving 48.07% faster convergence and yielding improvements of 4.6%, 4.13%, and 2.39% in training, validation, and testing accuracies, respectively, when applied to a CNN trained on the CIFAR-10 dataset. While these results are promising, further empirical validation across diverse tasks is essential to fully gauge EXAdam's efficacy. Nevertheless, EXAdam represents a significant advancement in adaptive optimization techniques, with promising implications for a wide range of machine learning applications. This work aims to contribute to the ongoing development of more efficient, adaptive, and universally applicable optimization methods in the field of machine learning and artificial intelligence.

  • 1 authors
·
Dec 28, 2024

ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Recently, training-free methods for improving large language models (LLMs) have attracted growing interest, with token-level attention tuning emerging as a promising and interpretable direction. However, existing methods typically rely on auxiliary mechanisms to identify important or irrelevant task-specific tokens, introducing potential bias and limiting applicability. In this paper, we uncover a surprising and elegant alternative: the semantically empty initial token is a powerful and underexplored control point for optimizing model behavior. Through theoretical analysis, we show that tuning the initial token's attention sharpens or flattens the attention distribution over subsequent tokens, and its role as an attention sink amplifies this effect. Empirically, we find that: (1) tuning its attention improves LLM performance more effectively than tuning other task-specific tokens; (2) the effect follows a consistent trend across layers, with earlier layers having greater impact, but varies across attention heads, with different heads showing distinct preferences in how they attend to this token. Based on these findings, we propose ZeroTuning, a training-free approach that improves LLM performance by applying head-specific attention adjustments to this special token. Despite tuning only one token, ZeroTuning achieves higher performance on text classification, multiple-choice, and multi-turn conversation tasks across models such as Llama, Qwen, and DeepSeek. For example, ZeroTuning improves Llama-3.1-8B by 11.71% on classification, 2.64% on QA tasks, and raises its multi-turn score from 7.804 to 7.966. The method is also robust to limited resources, few-shot settings, long contexts, quantization, decoding strategies, and prompt variations. Our work sheds light on a previously overlooked control point in LLMs, offering new insights into both inference-time tuning and model interpretability.

  • 4 authors
·
May 16, 2025

Transient Stability Analysis with Physics-Informed Neural Networks

We explore the possibility to use physics-informed neural networks to drastically accelerate the solution of ordinary differential-algebraic equations that govern the power system dynamics. When it comes to transient stability assessment, the traditionally applied methods either carry a significant computational burden, require model simplifications, or use overly conservative surrogate models. Conventional neural networks can circumvent these limitations but are faced with high demand of high-quality training datasets, while they ignore the underlying governing equations. Physics-informed neural networks are different: they incorporate the power system differential algebraic equations directly into the neural network training and drastically reduce the need for training data. This paper takes a deep dive into the performance of physics-informed neural networks for power system transient stability assessment. Introducing a new neural network training procedure to facilitate a thorough comparison, we explore how physics-informed neural networks compare with conventional differential-algebraic solvers and classical neural networks in terms of computation time, requirements in data, and prediction accuracy. We illustrate the findings on the Kundur two-area system, and assess the opportunities and challenges of physics-informed neural networks to serve as a transient stability analysis tool, highlighting possible pathways to further develop this method.

  • 3 authors
·
Mar 14, 2023

The Hidden Power of Scaling Factor in LoRA Optimization

In Low-Rank Adaptation (LoRA), the scaling factor α is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor α and the learning rate function differently, with α emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, α outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-α, a minimalist framework that restores α to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-α consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

  • 13 authors
·
Jun 10 2

Leveraging Data-Driven Models for Accurate Analysis of Grid-Tied Smart Inverters Dynamics

The integration of power electronic converters (PECs) and distributed energy resources (DERs) in modern power systems has introduced dynamism and complexity. Accurate simulation becomes essential to comprehend the influence of converter domination on the power grid. This study addresses the fast-switching and stochastic behaviors exhibited by inverter-based resources in converter-dominated power systems, highlighting the necessity for precise analytical models. In the realm of modeling real-world systems, multiple methodologies exist. Notably, black-box and data-driven system identification techniques are employed to construct PEC models using experimental data, without relying on a priori knowledge of the internal system physics. This approach entails a systematic process of model class selection, parameter estimation, and model validation. While a range of linear and nonlinear model structures and estimation algorithms are at our disposal, it remains imperative to harness creativity and a profound understanding of the physical system to craft data-driven models that align seamlessly with their intended applications. These applications may encompass simulation, prediction, control, or fault detection. This report offers valuable insights into the collection of datasets from commercial off-the-shelf inverters, along with the presentation of intricate simulation models.

  • 9 authors
·
Oct 2, 2023

SE#PCFG: Semantically Enhanced PCFG for Password Analysis and Cracking

Much research has been done on user-generated textual passwords. Surprisingly, semantic information in such passwords remain underinvestigated, with passwords created by English- and/or Chinese-speaking users being more studied with limited semantics. This paper fills this gap by proposing a general framework based on semantically enhanced PCFG (probabilistic context-free grammars) named SE#PCFG. It allowed us to consider 43 types of semantic information, the richest set considered so far, for semantic password analysis. Applying SE#PCFG to 17 large leaked password databases of user speaking four languages (English, Chinese, German and French), we demonstrate its usefulness and report a wide range of new insights about password semantics at different levels such as cross-website password correlations. Furthermore, based on SE#PCFG and a new systematic smoothing method, we proposed the Semantically Enhanced Password Cracking Architecture (SEPCA). To compare the performance of SEPCA against three state-of-the-art (SOTA) benchmarks in terms of the password coverage rate: two other PCFG variants and FLA. Our experimental results showed that SEPCA outperformed all the three benchmarks consistently and significantly across 52 test cases, by up to 21.53%, 52.55% and 7.86%, respectively, at the user level (with duplicate passwords). At the level of unique passwords, SEPCA also beats the three benchmarks by up to 33.32%, 86.19% and 10.46%, respectively. The results demonstrated the power of SEPCA as a new password cracking framework.

  • 5 authors
·
Jun 11, 2023

Understanding Chain-of-Thought in Large Language Models via Topological Data Analysis

With the development of large language models (LLMs), particularly with the introduction of the long reasoning chain technique, the reasoning ability of LLMs in complex problem-solving has been significantly enhanced. While acknowledging the power of long reasoning chains, we cannot help but wonder: Why do different reasoning chains perform differently in reasoning? What components of the reasoning chains play a key role? Existing studies mainly focus on evaluating reasoning chains from a functional perspective, with little attention paid to their structural mechanisms. To address this gap, this work is the first to analyze and evaluate the quality of the reasoning chain from a structural perspective. We apply persistent homology from Topological Data Analysis (TDA) to map reasoning steps into semantic space, extract topological features, and analyze structural changes. These changes reveal semantic coherence, logical redundancy, and identify logical breaks and gaps. By calculating homology groups, we assess connectivity and redundancy at various scales, using barcode and persistence diagrams to quantify stability and consistency. Our results show that the topological structural complexity of reasoning chains correlates positively with accuracy. More complex chains identify correct answers sooner, while successful reasoning exhibits simpler topologies, reducing redundancy and cycles, enhancing efficiency and interpretability. This work provides a new perspective on reasoning chain quality assessment and offers guidance for future optimization.

  • 13 authors
·
Dec 22, 2025

DeepSolarEye: Power Loss Prediction and Weakly Supervised Soiling Localization via Fully Convolutional Networks for Solar Panels

The impact of soiling on solar panels is an important and well-studied problem in renewable energy sector. In this paper, we present the first convolutional neural network (CNN) based approach for solar panel soiling and defect analysis. Our approach takes an RGB image of solar panel and environmental factors as inputs to predict power loss, soiling localization, and soiling type. In computer vision, localization is a complex task which typically requires manually labeled training data such as bounding boxes or segmentation masks. Our proposed approach consists of specialized four stages which completely avoids localization ground truth and only needs panel images with power loss labels for training. The region of impact area obtained from the predicted localization masks are classified into soiling types using the webly supervised learning. For improving localization capabilities of CNNs, we introduce a novel bi-directional input-aware fusion (BiDIAF) block that reinforces the input at different levels of CNN to learn input-specific feature maps. Our empirical study shows that BiDIAF improves the power loss prediction accuracy by about 3% and localization accuracy by about 4%. Our end-to-end model yields further improvement of about 24% on localization when learned in a weakly supervised manner. Our approach is generalizable and showed promising results on web crawled solar panel images. Our system has a frame rate of 22 fps (including all steps) on a NVIDIA TitanX GPU. Additionally, we collected first of it's kind dataset for solar panel image analysis consisting 45,000+ images.

  • 5 authors
·
Oct 10, 2017

Building Power Grid Models from Open Data: A Complete Pipeline from OpenStreetMap to Optimal Power Flow

Access to realistic transmission grid models is essential for power systems research, yet detailed network data in the United States remains restricted under critical-infrastructure regulations. We present a pipeline that constructs complete, OPF-solvable transmission network models entirely from publicly available data. The five-stage pipeline (1) extracts power infrastructure from OpenStreetMap via a local Overpass API instance, (2) reconstructs bus-branch topology through voltage inference, line merging, and transformer detection, (3) estimates electrical parameters using voltage-class lookup tables calibrated with U.S. Energy Information Administration (EIA) plant-level data, (4) allocates hourly demand from EIA-930 to individual buses using US Census population as a spatial proxy, and (5) solves both DC and AC optimal power flow using PowerModels.jl with a progressive relaxation strategy that automatically loosens constraints on imprecise models. We validate the pipeline on all 48 contiguous US states and six multi-state regions, including the full Western (5,076 buses) and Eastern (21,697 buses) Interconnections. Of the 48 single-state models, 42 (88%) converge at the strictest relaxation level for AC-OPF at peak hour and 44 (92%) off-peak. Dispatch costs (median $22/MWh) and system losses (median 1.0%) are consistent with real wholesale-market outcomes. The pipeline relies exclusively on open data sources, enabling reproducible grid analysis without proprietary data. All 54 models (48 single-state and 6 multi-state) are publicly released at https://github.com/microsoft/GridSFM.

  • 6 authors
·
May 4

Experimental and Computational Analysis of the Hydrodynamics of Droplet Generation in a Cylindrical Microfluidic Device

This study investigates the hydrodynamics of droplet formation in a T-shaped cylindrical microfluidic device using micro-PIV experiments and CFD simulations. Devices of 150 micro-m internal diameter were fabricated from PDMS via a cost-effective embedded templating method. Flow visualization was conducted using immiscible silicone oil and deionized water, forming water-in-oil droplets. A mathematical model coupling the Navier-Stokes and conservative level-set equations was solved using the finite element method. Detailed flow fields (velocity, pressure, and phase distribution) were obtained over a wide range of flow-rate ratios (0.1-10) and capillary numbers (0.001-0.1) to characterize droplet formation mechanisms. Phase evolution revealed distinct breakup stages (lag, filling, necking, and pinch-off) and multiple regimes (squeezing, dripping, sausage flow, and parallel flow with tip streaming). A regime map delineating droplet and non-droplet regions was developed. Droplet size, curvature, and internal flow profiles exhibited strong dependence on Ca and Qr. Scaling analysis showed linear dependence of droplet size on Qr in the squeezing regime, with curvature nearly independent of Qr. In contrast, both size and curvature followed power-law dependence on Ca and Qr in the dripping regime. Velocity fields inside droplets were laminar and parabolic in the core. Fully developed plug-like profiles appeared in squeezing, whereas front and rear regions remained developing in dripping. Correlations for droplet length, curvature, and film thickness, including a novel thin-film model incorporating visco-inertial and capillary effects, enable predictive design within the studied range. These findings advance fundamental understanding of confined droplet dynamics and provide quantitative guidelines for optimizing droplet-based microfluidic systems.

  • 3 authors
·
Mar 3

PF$Δ$: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations

Power flow (PF) calculations are the backbone of real-time grid operations, across workflows such as contingency analysis (where repeated PF evaluations assess grid security under outages) and topology optimization (which involves PF-based searches over combinatorially large action spaces). Running these calculations at operational timescales or across large evaluation spaces remains a major computational bottleneck. Additionally, growing uncertainty in power system operations from the integration of renewables and climate-induced extreme weather also calls for tools that can accurately and efficiently simulate a wide range of scenarios and operating conditions. Machine learning methods offer a potential speedup over traditional solvers, but their performance has not been systematically assessed on benchmarks that capture real-world variability. This paper introduces PFΔ, a benchmark dataset for power flow that captures diverse variations in load, generation, and topology. PFΔ contains 859,800 solved power flow instances spanning six different bus system sizes, capturing three types of contingency scenarios (N , N -1, and N -2), and including close-to-infeasible cases near steady-state voltage stability limits. We evaluate traditional solvers and GNN-based methods, highlighting key areas where existing approaches struggle, and identifying open problems for future research. Our dataset is available at https://huggingface.co/datasets/pfdelta/pfdelta/tree/main and our code with data generation scripts and model implementations is at https://github.com/MOSSLab-MIT/pfdelta.

  • 4 authors
·
Jan 25

Reasoning with Latent Thoughts: On the Power of Looped Transformers

Large language models have shown remarkable reasoning abilities and scaling laws suggest that large parameter count, especially along the depth axis, is the primary driver. In this work, we make a stronger claim -- many reasoning problems require a large depth but not necessarily many parameters. This unlocks a novel application of looped models for reasoning. Firstly, we show that for many synthetic reasoning problems like addition, p-hop induction, and math problems, a k-layer transformer looped L times nearly matches the performance of a kL-layer non-looped model, and is significantly better than a k-layer model. This is further corroborated by theoretical results showing that many such reasoning problems can be solved via iterative algorithms, and thus, can be solved effectively using looped models with nearly optimal depth. Perhaps surprisingly, these benefits also translate to practical settings of language modeling -- on many downstream reasoning tasks, a language model with k-layers looped L times can be competitive to, if not better than, a kL-layer language model. In fact, our empirical analysis reveals an intriguing phenomenon: looped and non-looped models exhibit scaling behavior that depends on their effective depth, akin to the inference-time scaling of chain-of-thought (CoT) reasoning. We further elucidate the connection to CoT reasoning by proving that looped models implicitly generate latent thoughts and can simulate T steps of CoT with T loops. Inspired by these findings, we also present an interesting dichotomy between reasoning and memorization, and design a looping-based regularization that is effective on both fronts.

  • 5 authors
·
Feb 24, 2025

Signatures of the Shock Interaction as an Additional Power Source in the Nebular Spectra of SN 2023ixf

Red supergiants may lose significant mass through steady winds and episodic eruptions in the final 100-1000 years before the core collapses, shaping their circumstellar environment. Interaction between supernova (SN) ejecta and distant circumstellar material (CSM) can generate shocks, which can energize the ejecta and serve as a key power source during the nebular phase of the SN. In the present work, we investigate the nebular spectrum of SN 2023ixf, observed one year post-explosion (at +363 d) with the recently commissioned WEAVE instrument on the 4.2m William Herschel Telescope. This marks the first supernova spectrum captured with WEAVE. In this spectrum, Halpha exhibits a peculiar evolution, flanked by blueward and redward broad components centred at simpm 5650,km,s^{-1} from the rest velocity of Halpha, which are seen for only a few SNe to date. These features indicate energy deposition from shocks generated by the interaction of ejecta with a CSM expelled nearly 350 - 640 years pre-explosion. Comparisons of the +363 d spectrum with model spectra from the literature, that include varying shock powers, suggest a shock power of at least sim 5 times 10 ^{40},erg,s^{-1} at this epoch. Additionally, analysis of the [O I] doublet, along with other prominent emission lines, provides evidence for clumpiness, dust formation, and asymmetry within the ejecta and/or the surrounding CSM. These emission lines also helped to constrain the oxygen mass (approx0.19^{scriptscriptstyle +0.08}_{scriptscriptstyle -0.04} M_odot), He-core mass (<3 M_odot) and the zero-age main sequence mass (lesssim 12 M_odot) of the progenitor of SN 2023ixf. The comparison with other Type II SNe highlights SN 2023ixf's unique shock interaction signatures and evidence of dust formation, setting it apart in terms of evolution and dynamics.

  • 5 authors
·
Dec 4, 2024

MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series Analysis

The complexity and heterogeneity of data in many real-world applications pose significant challenges for traditional machine learning and signal processing techniques. For instance, in medicine, effective analysis of diverse physiological signals is crucial for patient monitoring and clinical decision-making and yet highly challenging. We introduce MedTsLLM, a general multimodal large language model (LLM) framework that effectively integrates time series data and rich contextual information in the form of text to analyze physiological signals, performing three tasks with clinical relevance: semantic segmentation, boundary detection, and anomaly detection in time series. These critical tasks enable deeper analysis of physiological signals and can provide actionable insights for clinicians. We utilize a reprogramming layer to align embeddings of time series patches with a pretrained LLM's embedding space and make effective use of raw time series, in conjunction with textual context. Given the multivariate nature of medical datasets, we develop methods to handle multiple covariates. We additionally tailor the text prompt to include patient-specific information. Our model outperforms state-of-the-art baselines, including deep learning models, other LLMs, and clinical methods across multiple medical domains, specifically electrocardiograms and respiratory waveforms. MedTsLLM presents a promising step towards harnessing the power of LLMs for medical time series analysis that can elevate data-driven tools for clinicians and improve patient outcomes.

  • 7 authors
·
Aug 13, 2024

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

Vision transformers have emerged as a promising alternative to convolutional neural networks for various image analysis tasks, offering comparable or superior performance. However, one significant drawback of ViTs is their resource-intensive nature, leading to increased memory footprint, computation complexity, and power consumption. To democratize this high-performance technology and make it more environmentally friendly, it is essential to compress ViT models, reducing their resource requirements while maintaining high performance. In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. Unlike unstructured pruning or channel-wise structured pruning, block pruning leverages the block-wise structure of linear layers, resulting in more efficient matrix multiplications. To optimize this pruning scheme, our paper proposes a novel hardware-aware learning objective that simultaneously maximizes speedup and minimizes power consumption during inference, tailored to the block sparsity structure. This objective eliminates the need for empirical look-up tables and focuses solely on reducing parametrized layer connections. Moreover, our paper provides a lightweight algorithm to achieve post-training pruning for ViTs, utilizing second-order Taylor approximation and empirical optimization to solve the proposed hardware-aware objective. Extensive experiments on ImageNet are conducted across various ViT architectures, including DeiT-B and DeiT-S, demonstrating competitive performance with other pruning methods and achieving a remarkable balance between accuracy preservation and power savings. Especially, we achieve up to 3.93x and 1.79x speedups on dedicated hardware and GPUs respectively for DeiT-B, and also observe an inference power reduction by 1.4x on real-world GPUs.

  • 9 authors
·
Jul 2, 2024

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis

Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to one single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.

  • 1 authors
·
May 14, 2024

The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis

Uncovering early-stage metrics that reflect final model performance is one core principle for large-scale pretraining. The existing scaling law demonstrates the power-law correlation between pretraining loss and training flops, which serves as an important indicator of the current training state for large language models. However, this principle only focuses on the model's compression properties on the training data, resulting in an inconsistency with the ability improvements on the downstream tasks. Some follow-up works attempted to extend the scaling-law to more complex metrics (such as hyperparameters), but still lacked a comprehensive analysis of the dynamic differences among various capabilities during pretraining. To address the aforementioned limitations, this paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints. Through this analysis, we confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes, up to 67 billion parameters. In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints. This initiative offers valuable resources to the research community and facilitates the verification and exploration of LLM pretraining by open-source researchers. Besides, we provide empirical summaries, including performance comparisons of different models and capabilities, and tuition of key metrics for different training phases. Based on these findings, we provide a more user-friendly strategy for evaluating the optimization state, offering guidance for establishing a stable pretraining process.

  • 16 authors
·
Apr 1, 2024

A systematic analysis of the radio properties of 22 X-ray selected tidal disruption event candidates with the Australia Telescope Compact Array

We present a systematic analysis of the radio properties of an X-ray selected sample of tidal disruption event (TDE) candidates discovered by the eROSITA telescope. We find radio sources coincident with half of the transient events (11 TDEs), with 8 radio sources showing statistically significant variability over a 6-month period. We model the radio spectra of 6 sources with sufficiently bright radio emission and find the sources show radio spectra consistent with optically thin synchrotron emission and radio outflow minimum radii of 10^{16}--10^{17} cm, velocities 0.01--0.05 c, and energies 10^{48}--10^{51} erg. On comparison with the radio properties of an optically-selected TDE sample at similar late times, we find no significant difference in the radio luminosity range or radio detection rate. We find a tentative positive trend with peak radio and X-ray luminosity, but require further observations to determine if this is real or due to observational bias due to the large range in distances of the events. Interestingly, none of the X-ray selected events show late rising radio emission, compared to 45% of radio-detected sources of an optically-selected sample that showed late rising radio emission. We propose that this may indicate that many TDEs launch radio outflows at or near peak X-ray luminosity, which can be significantly delayed from peak optical luminosity. This study presents the first systematic analysis of the radio properties of an X-ray selected sample of TDEs, and gives insight into the possible link between the physical processes that power X-ray and radio emission in TDEs.

  • 10 authors
·
Apr 11, 2025

Analysis of the JWST spectra of the kilonova AT 2023vfi accompanying GRB 230307A

Kilonovae are key to advancing our understanding of r-process nucleosynthesis. To date, only two kilonovae have been spectroscopically observed, AT 2017gfo and AT 2023vfi. Here, we present an analysis of the James Webb Space Telescope (JWST) spectra obtained +29 and +61 days post-merger for AT 2023vfi (the kilonova associated with GRB 230307A). After re-reducing and photometrically flux-calibrating the data, we empirically model the observed X-ray to mid-infrared continua with a power law and a blackbody, to replicate the non-thermal afterglow and apparent thermal continuum gtrsim 2 , mum. We fit Gaussians to the apparent emission features, obtaining line centroids of 20218_{-38}^{+37}, 21874 pm 89 and 44168_{-152}^{+153}\,\AA, and velocity widths spanning 0.057 - 0.110\,c. These line centroid constraints facilitated a detailed forbidden line identification search, from which we shortlist a number of r-process species spanning all three r-process peaks. We rule out Ba II and Ra II as candidates and propose Te I-III, Er I-III and W III as the most promising ions for further investigation, as they plausibly produce multiple emission features from one (W III) or multiple (Te I-III, Er I-III) ion stages. We compare to the spectra of AT 2017gfo, which also exhibit prominent emission at sim 2.1 , mum, and conclude that [Te III] lambda21050 remains the most plausible cause of the observed sim 2.1 , mum emission in both kilonovae. However, the observed line centroids are not consistent between both objects, and they are significantly offset from [Te III] lambda21050. The next strongest [Te III] transition at 29290\,\AA\ is not observed, and we quantify its detectability. Further study is required, with particular emphasis on expanding the available atomic data to enable quantitative non-LTE spectral modelling.

  • 2 authors
·
Aug 20, 2024

Biases in Edge Language Models: Detection, Analysis, and Mitigation

The integration of large language models (LLMs) on low-power edge devices such as Raspberry Pi, known as edge language models (ELMs), has introduced opportunities for more personalized, secure, and low-latency language intelligence that is accessible to all. However, the resource constraints inherent in edge devices and the lack of robust ethical safeguards in language models raise significant concerns about fairness, accountability, and transparency in model output generation. This paper conducts a comparative analysis of text-based bias across language model deployments on edge, cloud, and desktop environments, aiming to evaluate how deployment settings influence model fairness. Specifically, we examined an optimized Llama-2 model running on a Raspberry Pi 4; GPT 4o-mini, Gemini-1.5-flash, and Grok-beta models running on cloud servers; and Gemma2 and Mistral models running on a MacOS desktop machine. Our results demonstrate that Llama-2 running on Raspberry Pi 4 is 43.23% and 21.89% more prone to showing bias over time compared to models running on the desktop and cloud-based environments. We also propose the implementation of a feedback loop, a mechanism that iteratively adjusts model behavior based on previous outputs, where predefined constraint weights are applied layer-by-layer during inference, allowing the model to correct bias patterns, resulting in 79.28% reduction in model bias.

  • 3 authors
·
Feb 16, 2025 1

A Markov-Chain-Monte-Carlo-based Hybrid Noise Inference for Continuous Wavelet Power Spectra: with Applications to Solar and Stellar Oscillatory Signals

Detecting oscillations in solar and stellar time series is complicated by non-stationary red noise and evolving background emission. Methods based on detrending and AR(1)-based wavelet analysis can introduce spurious periodicities and do not adequately describe time-dependent backgrounds. We develop a Bayesian approach that combines the continuous wavelet transform with MCMC sampling to infer a time-dependent background spectrum. The background is represented by a power-law plus white-noise component, with parameters allowed to vary smoothly in time, so that significance levels can be evaluated locally without explicit detrending. Tests with synthetic data show that injected oscillations are recovered reliably, while false detections are suppressed in pure-noise cases. Using a frequency-domain signal-to-noise ratio (S/N), we find that oscillations can be identified robustly when the S/N is greater than or equal to 2 under mixed noise conditions. The detectable period range is limited by wavelet resolution, from about 3-4 sampling intervals up to roughly one-quarter of the total duration. Application to GOES soft X-ray flare observations shows that the method isolates quasi-periodic oscillations with improved temporal localization compared to standard wavelet and Fourier-based approaches. Meanwhile, this behavior is consistent across a range of noise conditions and signal morphologies.

  • 3 authors
·
May 21

Open-source implementation of distribution network reconfiguration methods: Analysis and comparison

This paper presents a critical and practical approach to the evolution of distribution network reconfiguration algorithms, tracing their development from foundational heuristic methods introduced in 1975 to contemporary state-of-the-art techniques. The article systematically reviews seven different methodologies, including classical heuristic algorithms (Merlin, Baran, and others), advanced meta-heuristic methodologies (particle swarm optimization (PSO) and genetic algorithms), and purely mathematical approaches (MILP-based), analyzing their theoretical foundations, implementation strategies, computational complexity, and performance metrics based on extensive literature review and our own empirical testing. Each methodology is assessed through standardized test systems, considering multiple objectives such as power loss minimization and voltage profile improvement. The comparative analysis reveals the strengths and limitations of each approach under various network conditions and operational constraints. Furthermore, this work provides significant value to the research community by offering an open-source repository containing documented implementations of all reviewed algorithms. This resource facilitates accessibility for newcomers to the field, promotes reproducible research, and accelerates the development of next-generation distribution network optimization solutions. The repository includes comprehensive documentation, test cases, and performance benchmarks.

  • 3 authors
·
Nov 28, 2025

DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors

Machine Learning (ML) functions are becoming ubiquitous in latency- and privacy-sensitive IoT applications, prompting a shift toward near-sensor processing at the extreme edge and the consequent increasing adoption of Parallel Ultra-Low Power (PULP) IoT processors. These compute- and memory-constrained parallel architectures need to run efficiently a wide range of algorithms, including key Non-Neural ML kernels that compete favorably with Deep Neural Networks (DNNs) in terms of accuracy under severe resource constraints. In this paper, we focus on enabling efficient parallel execution of Non-Neural ML algorithms on two RISCV-based PULP platforms, namely GAP8, a commercial chip, and PULP-OPEN, a research platform running on an FPGA emulator. We optimized the parallel algorithms through a fine-grained analysis and intensive optimization to maximize the speedup, considering two alternative Floating-Point (FP) emulation libraries on GAP8 and the native FPU support on PULP-OPEN. Experimental results show that a target-optimized emulation library can lead to an average 1.61x runtime improvement and 37% energy reduction compared to a standard emulation library, while the native FPU support reaches up to 32.09x and 99%, respectively. In terms of parallel speedup, our design improves the sequential execution by 7.04x on average on the targeted octa-core platforms leading to energy and latency decrease up to 87%. Lastly, we present a comparison with the ARM Cortex-M4 microcontroller (MCU), a widely adopted commercial solution for edge deployments, which is 12.87x slower and 98% less energy-efficient than PULP-OPEN.

  • 3 authors
·
Jul 16, 2021

Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering

Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.

  • 1 authors
·
May 13

Contributions to Robust and Efficient Methods for Analysis of High Dimensional Data

A ubiquitous feature of data of our era is their extra-large sizes and dimensions. Analyzing such high-dimensional data poses significant challenges, since the feature dimension is often much larger than the sample size. This thesis introduces robust and computationally efficient methods to address several common challenges associated with high-dimensional data. In my first manuscript, I propose a coherent approach to variable screening that accommodates nonlinear associations. I develop a novel variable screening method that transcends traditional linear assumptions by leveraging mutual information, with an intended application in neuroimaging data. This approach allows for accurate identification of important variables by capturing nonlinear as well as linear relationships between the outcome and covariates. Building on this foundation, I develop new optimization methods for sparse estimation using nonconvex penalties in my second manuscript. These methods address notable challenges in current statistical computing practices, facilitating computationally efficient and robust analyses of complex datasets. The proposed method can be applied to a general class of optimization problems. In my third manuscript, I contribute to robust modeling of high-dimensional correlated observations by developing a mixed-effects model based on Tsallis power-law entropy maximization and discussed the theoretical properties of such distribution. This model surpasses the constraints of conventional Gaussian models by accommodating a broader class of distributions with enhanced robustness to outliers. Additionally, I develop a proximal nonlinear conjugate gradient algorithm that accelerates convergence while maintaining numerical stability, along with rigorous statistical properties for the proposed framework.

  • 1 authors
·
Sep 9, 2025

FEMBA on the Edge: Physiologically-Aware Pre-Training, Quantization, and Deployment of a Bidirectional Mamba EEG Foundation Model on an Ultra-low Power Microcontroller

Objective: To enable continuous, long-term neuro-monitoring on wearable devices by overcoming the computational bottlenecks of Transformer-based Electroencephalography (EEG) foundation models and the quantization challenges inherent to State-Space Models (SSMs). Methods: We present FEMBA, a bidirectional Mamba architecture pre-trained on over 21,000 hours of EEG. We introduce a novel Physiologically-Aware pre-training objective, consisting of a reconstruction with low-pass filtering, to prioritize neural oscillations over high-frequency artifacts. To address the activation outliers common in SSMs, we employ Quantization-Aware Training (QAT) to compress the model to 2-bit weights. The framework is deployed on a parallel ultra-low-power RISC-V microcontroller (GAP9) using a custom double-buffered memory streaming scheme. Results: The proposed low-pass pre-training improves downstream AUROC on TUAB from 0.863 to 0.893 and AUPR from 0.862 to 0.898 compared to the best contrastive baseline. QAT successfully compresses weights with negligible performance loss, whereas standard post-training quantization degrades accuracy by approximately 30\%. The embedded implementation achieves deterministic real-time inference (1.70~s per 5~s window) and reduces the memory footprint by 74\% (to approx2~MB), achieving competitive accuracy with up to 27times fewer FLOPs than Transformer benchmarks. Conclusion: FEMBA demonstrates that Mamba-based foundation models can be effectively quantized and deployed on extreme-edge hardware without sacrificing the representation quality required for robust clinical analysis. Significance: This work establishes the first full-stack framework for deploying large-scale EEG foundation models on ultra-low-power wearables, facilitating continuous, SSM based monitoring for epilepsy and sleep disorders.

  • 6 authors
·
Mar 17

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Quantization addresses the high resource demand for large language models (LLMs) by alleviating memory pressure and bandwidth congestion and providing significantly scaled compute power with a tolerable impact on accuracy. Four-bit floating point (FP4), the lowest-precision format that preserves essential numerical properties such as exponent and sign, has begun to be adopted in cutting-edge architectures, including Blackwell and AMD CDNA, to support LLM quantization and reduce deployment costs. Although aggressive quantization can yield efficiency gains, the quantization sensitivity of within-transformer layers and whether these sensitivities generalize across existing FP4 formats and model scales remain underexplored. To elucidate quantization sensitivity, this study conducts a systematic analysis of two FP4 formats, MXFP4 and NVFP4, across three Qwen2.5 model scales (0.5B, 7B, and 14B), using controlled component-wise and block-wise isolation methodologies. We observe that MLP up- and down-projection layers consistently dominate in terms of sensitivity, while gate and attention projections are moderately and substantially less sensitive to FP4 quantization, respectively. We further find that sensitivity does not universally localize to the final blocks, but early blocks can be highly sensitive, particularly under MXFP4. Our results provide a diagnostic characterization of the inference behavior of FP4 across components, depths, and FP4 formats.

  • 3 authors
·
Mar 4

Physics-Informed Neural Networks: a Plug and Play Integration into Power System Dynamic Simulations

Time-domain simulations are crucial for ensuring power system stability and avoiding critical scenarios that could lead to blackouts. The next-generation power systems require a significant increase in the computational cost and complexity of these simulations due to additional degrees of uncertainty, non-linearity and states. Physics-Informed Neural Networks (PINN) have been shown to accelerate single-component simulations by several orders of magnitude. However, their application to current time-domain simulation solvers has been particularly challenging since the system's dynamics depend on multiple components. Using a new training formulation, this paper introduces the first natural step to integrate PINNs into multi-component time-domain simulations. We propose PINNs as an alternative to other classical numerical methods for individual components. Once trained, these neural networks approximate component dynamics more accurately for longer time steps. Formulated as an implicit and consistent method with the transient simulation workflow, PINNs speed up simulation time by significantly increasing the time steps used. For explanation clarity, we demonstrate the training, integration, and simulation framework for several combinations of PINNs and numerical solution methods using the IEEE 9-bus system, although the method applies equally well to any power system size.

  • 3 authors
·
Jun 23, 2025

Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques

Recent progress in Language Models (LMs) has dramatically advanced the field of natural language processing (NLP), excelling at tasks like text generation, summarization, and question answering. However, their inference remains computationally expensive and energy intensive, especially in settings with limited hardware, power, or bandwidth. This makes it difficult to deploy LMs in mobile, edge, or cost sensitive environments. To address these challenges, recent approaches have introduced multi LLM intelligent model selection strategies that dynamically allocate computational resources based on query complexity -- using lightweight models for simpler queries and escalating to larger models only when necessary. This survey explores two complementary strategies for efficient LLM inference: (i) routing, which selects the most suitable model based on the query, and (ii) cascading or hierarchical inference (HI), which escalates queries through a sequence of models until a confident response is found. Both approaches aim to reduce computation by using lightweight models for simpler tasks while offloading only when needed. We provide a comparative analysis of these techniques across key performance metrics, discuss benchmarking efforts, and outline open challenges. Finally, we outline future research directions to enable faster response times, adaptive model selection based on task complexity, and scalable deployment across heterogeneous environments, making LLM based systems more efficient and accessible for real world applications.

  • 5 authors
·
Jun 5, 2025

AirMorph: Topology-Preserving Deep Learning for Pulmonary Airway Analysis

Accurate anatomical labeling and analysis of the pulmonary structure and its surrounding anatomy from thoracic CT is getting increasingly important for understanding the etilogy of abnormalities or supporting targetted therapy and early interventions. Whilst lung and airway cell atlases have been attempted, there is a lack of fine-grained morphological atlases that are clinically deployable. In this work, we introduce AirMorph, a robust, end-to-end deep learning pipeline enabling fully automatic and comprehensive airway anatomical labeling at lobar, segmental, and subsegmental resolutions that can be used to create digital atlases of the lung. Evaluated across large-scale multi-center datasets comprising diverse pulmonary conditions, the AirMorph consistently outperformed existing segmentation and labeling methods in terms of accuracy, topological consistency, and completeness. To simplify clinical interpretation, we further introduce a compact anatomical signature quantifying critical morphological airway features, including stenosis, ectasia, tortuosity, divergence, length, and complexity. When applied to various pulmonary diseases such as pulmonary fibrosis, emphysema, atelectasis, consolidation, and reticular opacities, it demonstrates strong discriminative power, revealing disease-specific morphological patterns with high interpretability and explainability. Additionally, AirMorph supports efficient automated branching pattern analysis, potentially enhancing bronchoscopic navigation planning and procedural safety, offering a valuable clinical tool for improved diagnosis, targeted treatment, and personalized patient care.

  • 11 authors
·
Dec 14, 2024

Exploring HOD-dependent systematics for the DESI 2024 Full-Shape galaxy clustering analysis

We analyse the robustness of the DESI 2024 cosmological inference from fits to the full shape of the galaxy power spectrum to uncertainties in the Halo Occupation Distribution (HOD) model of the galaxy-halo connection and the choice of priors on nuisance parameters. We assess variations in the recovered cosmological parameters across a range of mocks populated with different HOD models and find that shifts are often greater than 20% of the expected statistical uncertainties from the DESI data. We encapsulate the effect of such shifts in terms of a systematic covariance term, C_{rm HOD}, and an additional diagonal contribution quantifying the impact of our choice of nuisance parameter priors on the ability of the effective field theory (EFT) model to correctly recover the cosmological parameters of the simulations. These two covariance contributions are designed to be added to the usual covariance term, C_{rm stat}, describing the statistical uncertainty in the power spectrum measurement, in order to fairly represent these sources of systematic uncertainty. This approach is more general and robust to choices of model free parameters or additional external datasets used in cosmological fits than the alternative approach of adding systematic uncertainties at the level of the recovered marginalised parameter posteriors. We compare the approaches within the context of a fixed LambdaCDM model and demonstrate that our method gives conservative estimates of the systematic uncertainty that nevertheless have little impact on the final posteriors obtained from DESI data.

  • 42 authors
·
Nov 18, 2024

A Flexible Parametric Modelling Framework for Survival Analysis

We introduce a general, flexible, parametric survival modelling framework which encompasses key shapes of hazard function (constant, increasing, decreasing, up-then-down, down-then-up), various common survival distributions (log-logistic, Burr type XII, Weibull, Gompertz), and includes defective distributions (i.e., cure models). This generality is achieved using four basic distributional parameters: two scale-type parameters and two shape parameters. Generalising to covariate dependence, the scale-type regression components correspond to accelerated failure time (AFT) and proportional hazards (PH) models. Therefore, this general formulation unifies the most popular survival models which allows us to consider the practical value of possible modelling choices for survival data. Furthermore, in line with our proposed flexible baseline distribution, we advocate the use of multi-parameter regression in which more than one distributional parameter depends on covariates - rather than the usual convention of having a single covariate-dependent (scale) parameter. While many choices are available, we suggest introducing covariates through just one or other of the two scale parameters, which covers AFT and PH models, in combination with a `power' shape parameter, which allows for more complex non-AFT/non-PH effects, while the other shape parameter remains covariate-independent, and handles automatic selection of the baseline distribution. We explore inferential issues in simulations, both with and without a covariate, with particular focus on evidence concerning the need, or otherwise, to include both AFT and PH parameters. We illustrate the efficacy of our modelling framework by investigating differences between treatment groups using data from a lung cancer study and a melanoma study. Censoring is accommodated throughout.

  • 3 authors
·
Jan 10, 2019

Exploring a Physics-Informed Decision Transformer for Distribution System Restoration: Methodology and Performance Analysis

Driven by advancements in sensing and computing, deep reinforcement learning (DRL)-based methods have demonstrated significant potential in effectively tackling distribution system restoration (DSR) challenges under uncertain operational scenarios. However, the data-intensive nature of DRL poses obstacles in achieving satisfactory DSR solutions for large-scale, complex distribution systems. Inspired by the transformative impact of emerging foundation models, including large language models (LLMs), across various domains, this paper explores an innovative approach harnessing LLMs' powerful computing capabilities to address scalability challenges inherent in conventional DRL methods for solving DSR. To our knowledge, this study represents the first exploration of foundation models, including LLMs, in revolutionizing conventional DRL applications in power system operations. Our contributions are twofold: 1) introducing a novel LLM-powered Physics-Informed Decision Transformer (PIDT) framework that leverages LLMs to transform conventional DRL methods for DSR operations, and 2) conducting comparative studies to assess the performance of the proposed LLM-powered PIDT framework at its initial development stage for solving DSR problems. While our primary focus in this paper is on DSR operations, the proposed PIDT framework can be generalized to optimize sequential decision-making across various power system operations.

  • 4 authors
·
Jun 30, 2024

A Novel Bifurcation Method for Observation Perturbation Attacks on Reinforcement Learning Agents: Load Altering Attacks on a Cyber Physical Power System

Components of cyber physical systems, which affect real-world processes, are often exposed to the internet. Replacing conventional control methods with Deep Reinforcement Learning (DRL) in energy systems is an active area of research, as these systems become increasingly complex with the advent of renewable energy sources and the desire to improve their efficiency. Artificial Neural Networks (ANN) are vulnerable to specific perturbations of their inputs or features, called adversarial examples. These perturbations are difficult to detect when properly regularized, but have significant effects on the ANN's output. Because DRL uses ANN to map optimal actions to observations, they are similarly vulnerable to adversarial examples. This work proposes a novel attack technique for continuous control using Group Difference Logits loss with a bifurcation layer. By combining aspects of targeted and untargeted attacks, the attack significantly increases the impact compared to an untargeted attack, with drastically smaller distortions than an optimally targeted attack. We demonstrate the impacts of powerful gradient-based attacks in a realistic smart energy environment, show how the impacts change with different DRL agents and training procedures, and use statistical and time-series analysis to evaluate attacks' stealth. The results show that adversarial attacks can have significant impacts on DRL controllers, and constraining an attack's perturbations makes it difficult to detect. However, certain DRL architectures are far more robust, and robust training methods can further reduce the impact.

  • 3 authors
·
Jul 6, 2024

Planck 2018 results. V. CMB power spectra and likelihoods

This paper describes the 2018 Planck CMB likelihoods, following a hybrid approach similar to the 2015 one, with different approximations at low and high multipoles, and implementing several methodological and analysis refinements. With more realistic simulations, and better correction and modelling of systematics, we can now make full use of the High Frequency Instrument polarization data. The low-multipole 100x143 GHz EE cross-spectrum constrains the reionization optical-depth parameter tau to better than 15% (in combination with with the other low- and high-ell likelihoods). We also update the 2015 baseline low-ell joint TEB likelihood based on the Low Frequency Instrument data, which provides a weaker tau constraint. At high multipoles, a better model of the temperature-to-polarization leakage and corrections for the effective calibrations of the polarization channels (polarization efficiency or PE) allow us to fully use the polarization spectra, improving the constraints on the LambdaCDM parameters by 20 to 30% compared to TT-only constraints. Tests on the modelling of the polarization demonstrate good consistency, with some residual modelling uncertainties, the accuracy of the PE modelling being the main limitation. Using our various tests, simulations, and comparison between different high-ell implementations, we estimate the consistency of the results to be better than the 0.5sigma level. Minor curiosities already present before (differences between ell<800 and ell>800 parameters or the preference for more smoothing of the C_ell peaks) are shown to be driven by the TT power spectrum and are not significantly modified by the inclusion of polarization. Overall, the legacy Planck CMB likelihoods provide a robust tool for constraining the cosmological model and represent a reference for future CMB observations. (Abridged)

  • 168 authors
·
Jul 30, 2019

Convolutional Neural Networks on the HEALPix sphere: a pixel-based algorithm and its application to CMB data analysis

We describe a novel method for the application of Convolutional Neural Networks (CNNs) to fields defined on the sphere, using the HEALPix tessellation scheme. Specifically, We have developed a pixel-based approach to implement convolutional layers on the spherical surface, similarly to what is commonly done for CNNs in Euclidian space. The algorithm is fully integrable with existing libraries for NNs (e.g., PyTorch or TensorFlow). We present two applications: (i) recognition of handwritten digits projected on the sphere; (ii) estimation of cosmological parameter from Cosmic Microwave Background (CMB) simulated maps. We have built a simple NN architecture, consisting in four convolutional+pooling layers, and have used it for all the applications explored herein. For what concerns the handwritten digits, our CNN reaches an accuracy of about 95%, comparable with other existing spherical CNNs. For CMB applications, we have tested the CNN on the estimation of a "mock" parameter, defining the angular scale at which the power spectrum of a Gaussian field projected on the sphere peaks. We have estimated this parameter directly from maps, in several cases: temperature and polarization, presence of noise and partial sky coverage. In all the cases, the NN performances are comparable with those from standard spectrum-based bayesian methods. We demonstrate, for the first time, the capability of CNNs to extract information from polarization fields and to distinguish between E and B-modes. Lastly, we have applied our CNN to the estimation of the Thomson scattering optical depth at reionization (tau) from simulated CMB maps. Even without any specific optimization of the NN architecture, we reach an accuracy comparable with standard bayesian methods. This work represents a first step towards the exploitation of NNs in CMB parameter estimation and demonstrates the feasibility of our approach.

  • 2 authors
·
Jul 14, 2019

Potential and Limitation of High-Frequency Cores and Caches

This paper explores the potential of cryogenic semiconductor computing and superconductor electronics as promising alternatives to traditional semiconductor devices. As semiconductor devices face challenges such as increased leakage currents and reduced performance at higher temperatures, these novel technologies offer high performance and low power computation. Conventional semiconductor electronics operating at cryogenic temperatures (below -150{\deg}C or 123.15 K) can benefit from reduced leakage currents and improved electron mobility. On the other hand, superconductor electronics, operating below 10 K, allow electrons to flow without resistance, offering the potential for ultra-low-power, high-speed computation. This study presents a comprehensive performance modeling and analysis of these technologies and provides insights into their potential benefits and limitations. We implement models of in-order and out-of-order cores operating at high clock frequencies associated with superconductor electronics and cryogenic semiconductor computing in gem5. We evaluate the performance of these components using workloads representative of real-world applications like NPB, SPEC CPU2006, and GAPBS. Our results show the potential speedups achievable by these components and the limitations posed by cache bandwidth. This work provides valuable insights into the performance implications and design trade-offs associated with cryogenic and superconductor technologies, laying the foundation for future research in this field using gem5.

  • 3 authors
·
Aug 6, 2024

When Less Is More: Simplicity Beats Complexity for Physics-Constrained InSAR Phase Unwrapping

Operational phase unwrapping is the primary computational bottleneck in InSAR-based volcanic and seismic monitoring. We challenge the industry trend of adopting high-complexity computer vision architectures, such as attention mechanisms, without validating their suitability for physics-constrained geophysical regression. We present the first large-scale architectural ablation study on a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). Our results reveal a significant "complexity penalty": a vanilla U-Net (7.76M parameters) achieves R^2=0.834 and RMSE = 1.01 cm, outperforming 11.37M-parameter attention-based models by 34% in R^2 and 51% in RMSE. Power Spectral Density (PSD) analysis provides the physical justification: while attention excels at capturing sharp semantic edges in natural images, it injects unphysical high-frequency artifacts (>0.3 cycles/pixel) into geophysical fields, violating the fundamental smoothness constraints of elastic surface deformation. With a 2.92ms inference latency (a 2.5times speedup), the vanilla U-Net is the only candidate to comfortably meet the sub-100ms requirement for operational early-warning systems. This work bridges the "publication-to-practice" gap by proving that convolutional locality outperforms modern complexity for smooth-field regression, advocating for physics-informed simplicity in ML4RS. Code available at https://github.com/prabhjotschugh/When-Less-is-More-InSAR-Phase-Unwrapping

  • 2 authors
·
Apr 27

Phonon vibrational and transport properties of SnSe/SnS superlattice at finite temperatures

The structural stability and phonon properties of SnSe/SnS superlattices at finite temperatures have been studied using machine learning force field molecular dynamics and the anharmonic phonon approach. The vertical SnSe/SnS superlattice undergoes a phase transition from the Pnma phase to a novel P4/nmm phase at finite temperatures, which is different from the high-temperature Cmcm phase of the SnSe and SnS systems. The stability of P4/nmm phase is determined by molecular dynamics trajectories and anharmonic phonon dispersion relations. The imaginary modes of TO modes at the q=M(1/2,1/2,0) point of the P4/nmm phase in harmonic approximation become rigid at elevated temperatures. An analysis of phonon power spectra upon temperature also confirms the dynamic stabilization. The P4/nmm phase has higher symmetry than the Pnma phase, and the phase transition between them is accompanied by competition between the Jahn-Teller effect and phonon anharmonicity. Unlike the anisotropic distribution of Sn-Se/S bonds in the Pnma phase, the P4/nmm phase forms chemical bonds with similar bond lengths both in-plane and interlayer, and their resonance effect can significantly enhance phonon scattering. The calculated phonon density of states and lifetime is strongly temperature dependent, demonstrating the heavy anharmonicity in the SnSe/SnS system. The P4/nmm phase has an extremely low lattice thermal conductivity, close to the experimental values of SnSe and SnS. Moreover, with the reduction of band gap and the enhancement of band degeneracy near the Fermi level, the P4/nmm phase exhibits superior electronic transport properties and significantly enhanced response to infrared and visible light. This makes it show great potential in thermoelectric and photovoltaic applications.

  • 4 authors
·
Feb 11, 2025

Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization -- an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget -- spanning 47 tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency (sim 1/iteration) and magnitude (sim 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.

  • 21 authors
·
Apr 13

Probing X-ray Timing and Spectral Variability in the Blazar PKS 2155-304 Over a Decade of XMM-Newton Observations

Blazars, a class of active galactic nuclei (AGN) powered by supermassive black holes, are known for their remarkable variability across multiple timescales and wavelengths. With advancements in both ground- and space-based telescopes, our understanding of AGN central engines has significantly improved. However, the mechanisms driving this variability remain elusive, and continue to fascinate both theorists and observers alike. The primary objective of this study is to constrain the X-ray variability properties of the TeV blazar PKS 2155-304. We conduct a comprehensive X-ray spectral and timing analysis, focusing on both long-term and intra-day variability. This analysis uses data from 22 epochs of XMM-Newton EPIC-pn observations, collected over 15 years (2000-2014). To investigate the variability of the source, we applied both timing and spectral analyses. For the timing analysis, we estimated fractional variability, variability amplitude, minimum variability timescales, flux distribution, and power spectral density (PSD). In the spectral analysis, we fitted the X-ray spectra using power-law, log-parabola, and broken power-law (BPL) models to determine the best-fitting parameters. Additionally, we studied the hardness ratio (HR). We observed moderate intra-day variability in most of the light curves. Seven out of the twenty-two observations showed a clear bimodal flux distribution, indicating the presence of two distinct flux states. Our analysis revealed a variable power-law PSD slope. Most HR plots did not show significant variation with flux, except for one observation (OBSID 0124930501), where HR increased with flux (Count/s). The fitted X-ray spectra favored the BPL model for the majority of observations. The findings of this work shed light on the intraday variability of blazars, providing insights into the non-thermal jet processes that drive the observed flux variations.

  • 8 authors
·
Oct 2, 2024