Ongoing Research

Participate in our

Software Engineering Productivity Research

Get data-driven insights on the productivity of your software engineering organization

Research Problem

Traditional metrics (lines of code, story points, commit counts, DORA) don't accurately measure engineering productivity.

Our Method

A machine learning model that replicates a panel of experts evaluating every code commit written by your engineers.

Since 2022, we've worked with

600+

Organizations

120K+

Engineers

Our research has been featured in

Criteria for Participation

We work with companies and organizations (not individuals)

Any Geography & Industry

👥

Minimum Company Size: 50+ Software Engineers

Git Only: GitHub, GitLab, Bitbucket, or Azure DevOps

Receive Insights in 3 Steps

Integrate Repository

⏱ ~5 min

Connect your Git repository

Provide Metadata

⏱ ~15-90 min

Share non-confidential organizational data

Receive Results

Get comprehensive productivity insights

Deployment Options

☁️

Cloud

Code processed in our secure cloud environment

🔒

On-Prem (Private Cloud)

Code never leaves your environment

Benefits of Participation

Quantify AI Impact

Measure the impact of AI on your engineering productivity

🔧

Optimize Teams

Optimize outsourcing vendors and team composition

⚡

Boost Productivity

Improve the productivity of your software engineering team

💡

Real-Time Insights

Receive transparency into the performance of every team

30+ Supported Languages / Frameworks

Frontend Languages

JavaScriptTypeScriptReactHTMLCSSVueAngularSassLess

Backend Languages

PythonJavaPHPLaravelC#RubyRuby on RailsGoKotlinRustScala

Mobile

KotlinJavaSwiftObjective-C

Systems & Low-Level

CC++RustGo

Other

SQLShellSolidityLua

Other Ongoing Research

AI Practices Benchmark

AI Engineering Practices Benchmark

Assess your organization's AI usage in software engineering and compare it against your industry.

AI Impact

Impact of AI on Engineering Productivity

Understand how AI tools like GitHub Copilot affect developer productivity and code quality.

Contact Us

Want to get in touch with the research team?

Publications

Predicting Expert Evaluations in Software Code ReviewsManual code reviews are an essential but time-consuming part of software development, often leading reviewers to prioritize technical issues while skipping valuable assessments. This paper presents an algorithmic model that automates aspects of code review typically avoided due to their complexity or subjectivity, such as assessing coding time, implementation time, and code complexity. Instead of replacing manual reviews, our model adds insights that help reviewers focus on more impactful tasks. Calibrated using expert evaluations, the model predicts key metrics from code commits with strong correlations to human judgments (r = 0.82 for coding time, r = 0.86 for implementation time). By automating these assessments, we reduce the burden on human reviewers and ensure consistent analysis of time-consuming areas, offering a scalable solution alongside manual reviews. This research shows how automated tools can enhance code reviews by addressing overlooked tasks, supporting data-driven decisions and improving the review process.

Measuring Determinism in Large Language Models for Software Code ReviewLarge Language Models (LLMs) promise to streamline software code reviews, but their ability to produce consistent assessments remains an open question. In this study, we tested four leading LLMs -- GPT-4o mini, GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 90B Vision -- on 70 Java commits from both private and public repositories. By setting each model's temperature to zero, clearing context, and repeating the exact same prompts five times, we measured how consistently each model generated code-review assessments. Our results reveal that even with temperature minimized, LLM responses varied to different degrees. These findings highlight a consideration about the inherently limited consistency (test-retest reliability) of LLMs -- even when the temperature is set to zero -- and the need for caution when using LLM-generated code reviews to make real-world decisions.

Position: Machine Learning Conferences Should Establish a "Refutations and Critiques" TrackScience progresses by iteratively advancing and correcting humanity's understanding of the world. In machine learning (ML) research, rapid advancements have led to an explosion of publications, but have also led to misleading, incorrect, flawed or perhaps even fraudulent studies being accepted and sometimes highlighted at ML conferences due to the fallibility of peer review. While such mistakes are understandable, ML conferences do not offer robust processes to help the field systematically correct when such errors are made. This position paper argues that ML conferences should establish a dedicated "Refutations and Critiques" (R&C) Track. This R&C Track would provide a high-profile, reputable platform to support vital research that critically challenges prior research, thereby fostering a dynamic self-correcting research ecosystem. We discuss key considerations including track design, review principles, potential pitfalls, and provide an illustrative example submission concerning a recent ICLR 2025 Oral. We conclude that ML conferences should create official, reputable mechanisms to help ML research self-correct.

Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language ModelsSampling from language models impacts the quality and diversity of outputs, affecting both research and real-world applications. Recently, Nguyen et al. 2024's "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" introduced a new sampler called min-p, claiming it achieves superior quality and diversity over established samplers such as basic, top-k, and top-p sampling. The significance of these claims was underscored by the paper's recognition as the 18th highest-scoring submission to ICLR 2025 and selection for an Oral presentation. This paper conducts a comprehensive re-examination of the evidence supporting min-p and reaches different conclusions from the original paper's four lines of evidence. First, the original paper's human evaluations omitted data, conducted statistical tests incorrectly, and described qualitative feedback inaccurately; our reanalysis demonstrates min-p did not outperform baselines in quality, diversity, or a trade-off between quality and diversity; in response to our findings, the authors of the original paper conducted a new human evaluation using a different implementation, task, and rubric that nevertheless provides further evidence min-p does not improve over baselines. Second, comprehensively sweeping the original paper's NLP benchmarks reveals min-p does not surpass baselines when controlling for the number of hyperparameters. Third, the original paper's LLM-as-a-Judge evaluations lack methodological clarity and appear inconsistently reported. Fourth, community adoption claims (49k GitHub repositories, 1.1M GitHub stars) were found to be unsubstantiated, leading to their removal; the revised adoption claim remains misleading. We conclude that evidence presented in the original paper fails to support claims that min-p improves quality, diversity, or a trade-off between quality and diversity.