PythonJavaPHPLaravelC#RubyRuby on RailsGoKotlinRustScala
Mobile
KotlinJavaSwiftObjective-C
Systems
& Low-Level
CC++RustGo
Other
SQLShellSolidityLua
Other
Ongoing Research
AI Practices Benchmark
AI Engineering Practices Benchmark
Assess your organization's AI usage in software engineering and compare it against your industry.
AI Impact
Impact of AI on Engineering Productivity
Understand how AI tools like GitHub Copilot affect developer productivity and code quality.
Contact Us
Want to get in touch with the research team?
Publications
1
Predicting
Expert Evaluations in Software Code ReviewsManual
code reviews are an essential but time-consuming part of software development, often leading
reviewers to prioritize technical issues while skipping valuable assessments. This paper presents
an algorithmic model that automates aspects of code review typically avoided due to their
complexity or subjectivity, such as assessing coding time, implementation time, and code
complexity. Instead of replacing manual reviews, our model adds insights that help reviewers focus
on more impactful tasks. Calibrated using expert evaluations, the model predicts key metrics from
code commits with strong correlations to human judgments (r = 0.82 for coding time, r = 0.86 for
implementation time). By automating these assessments, we reduce the burden on human reviewers and
ensure consistent analysis of time-consuming areas, offering a scalable solution alongside manual
reviews. This research shows how automated tools can enhance code reviews by addressing overlooked
tasks, supporting data-driven decisions and improving the review process.
2
Measuring
Determinism in Large Language Models for Software Code ReviewLarge
Language Models (LLMs) promise to streamline software code reviews, but their ability to produce
consistent assessments remains an open question. In this study, we tested four leading LLMs --
GPT-4o mini, GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 90B Vision -- on 70 Java commits from both
private and public repositories. By setting each model's temperature to zero, clearing context,
and repeating the exact same prompts five times, we measured how consistently each model generated
code-review assessments. Our results reveal that even with temperature minimized, LLM responses
varied to different degrees. These findings highlight a consideration about the inherently limited
consistency (test-retest reliability) of LLMs -- even when the temperature is set to zero -- and
the need for caution when using LLM-generated code reviews to make real-world decisions.
3
Position:
Machine Learning Conferences Should Establish a "Refutations and Critiques"
TrackScience
progresses by iteratively advancing and correcting humanity's understanding of the world. In
machine learning (ML) research, rapid advancements have led to an explosion of publications, but
have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted
and sometimes highlighted at ML conferences due to the fallibility of peer review. While such
mistakes are understandable, ML conferences do not offer robust processes to help the field
systematically correct when such errors are made. This position paper argues that ML conferences
should establish a dedicated "Refutations and Critiques" (R&C) Track. This R&C Track would
provide a high-profile, reputable platform to support vital research that critically challenges
prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key
considerations including track design, review principles, potential pitfalls, and provide an
illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML
conferences should create official, reputable mechanisms to help ML research self-correct.
4
Min-p,
Max Exaggeration: A Critical Analysis of Min-p Sampling in Language
ModelsSampling
from language models impacts the quality and diversity of outputs, affecting both research and
real-world applications. Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for
Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves
superior quality and diversity over established samplers such as basic, top-k, and top-p sampling.
The significance of these claims was underscored by the paper's recognition as the 18th
highest-scoring submission to ICLR 2025 and selection for an Oral presentation. This paper
conducts a comprehensive re-examination of the evidence supporting min-p and reaches different
conclusions from the original paper's four lines of evidence. First, the original paper's human
evaluations omitted data, conducted statistical tests incorrectly, and described qualitative
feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality,
diversity, or a trade-off between quality and diversity; in response to our findings, the authors
of the original paper conducted a new human evaluation using a different implementation, task, and
rubric that nevertheless provides further evidence min-p does not improve over baselines. Second,
comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass
baselines when controlling for the number of hyperparameters. Third, the original paper's
LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. Fourth,
community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be
unsubstantiated, leading to their removal; the revised adoption claim remains misleading. We
conclude that evidence presented in the original paper fails to support claims that min-p improves
quality, diversity, or a trade-off between quality and diversity.