· nervico-team · technical-leadership  Â· 12 min read

Engineering Metrics That Matter: DORA and Beyond

DORA metrics, the SPACE framework, and what to measure (and what not to) to improve your development team real productivity. With 2026 benchmarks and common implementation mistakes.

DORA metrics, the SPACE framework, and what to measure (and what not to) to improve your development team real productivity. With 2026 benchmarks and common implementation mistakes.

“What gets measured gets managed” is one of those aphorisms that sounds great but in practice causes as much harm as good. The obsession with measuring has led many development teams to optimize for the wrong metrics, with results ranging from team demotivation to active degradation of software quality.

The right question is not “what metrics should we measure.” It is “what decisions do we need to make and what data would help us make them better.” Metrics are decision-making tools, not goals in themselves.

This article covers the metrics that actually matter for evaluating and improving engineering team performance: DORA metrics as the foundation, the SPACE framework for a more complete picture, and the mistakes that destroy the usefulness of any metrics programme.

The Four DORA Metrics (and the Fifth)

The DORA (DevOps Research and Assessment) metrics are the industry standard for measuring software delivery performance. Developed by the research team led by Nicole Forsgren, they are the result of years of research across thousands of teams.

1. Deployment Frequency

What it measures: How often your team deploys changes to production.

Why it matters: Deployment frequency is a proxy for the team’s ability to deliver value continuously. Teams that deploy frequently tend to make smaller changes that are easier to debug and lower in risk.

2026 Benchmarks:

  • Elite: Multiple deploys per day
  • High: Between once per day and once per week
  • Medium: Between once per week and once per month
  • Low: Less than once per month

How to measure it: Count successful production deployments per period. Most CI/CD tools (GitHub Actions, GitLab CI, Jenkins) record this automatically.

Common trap: Measuring deployments without measuring what they contain. A team that deploys 10 times a day but only config changes is not more productive than one deploying complete features weekly.

2. Lead Time for Changes

What it measures: The time from when a commit is integrated into the trunk until it is in production.

Why it matters: A short lead time means the team can respond quickly to business needs. A long lead time indicates bottlenecks in the delivery process.

2026 Benchmarks:

  • Elite: Less than one hour
  • High: Between one day and one week
  • Medium: Between one week and one month
  • Low: More than one month

How to measure it: Record the timestamp of the merge to trunk and the timestamp of the production deployment. The difference is your lead time.

Important nuance: Lead time includes waiting for code review, waiting for QA, waiting for deployment windows, etc. Breaking down where time is spent is more useful than the aggregate number.

3. Change Failure Rate

What it measures: The percentage of deployments that cause a failure in production requiring intervention (rollback, hotfix, fix-forward).

Why it matters: It balances deployment frequency. Deploying 50 times a day is useless if 30% of deployments break something. The best organizations combine high frequency with low failure rates.

2026 Benchmarks:

  • Elite: 0-5%
  • High: 5-10%
  • Medium: 10-15%
  • Low: More than 15%

How to measure it: Number of deployments causing incidents / total number of deployments. Requires a clear, consistent definition of “incident.”

Current debate: The 2025 DORA report noted that teams adopting AI intensively show higher throughput but also higher instability. Change failure rate may increase temporarily during AI coding tool adoption.

4. Failed Deployment Recovery Time

What it measures: How long the team takes to restore service after a failed deployment.

Why it matters: Failures are inevitable. What differentiates elite teams is not that they do not fail, but that they recover quickly. A low recovery time indicates mature incident response processes.

Note: In earlier reports this metric was called Mean Time to Recovery (MTTR). DORA renamed it to Failed Deployment Recovery Time to distinguish it from broader availability metrics.

2026 Benchmarks:

  • Elite: Less than one hour
  • High: Less than one day
  • Medium: Between one day and one week
  • Low: More than one week

How to measure it: From when the incident is detected until service is restored. Requires clear timestamps for the start and end of each incident.

5. Rework Rate (the New Addition)

What it measures: The percentage of deployments that are unplanned fixes for user-facing bugs.

Why it matters: DORA officially incorporated rework rate as its fifth metric. It captures work that does not generate new value but corrects errors from previous work. A high rework rate indicates systemic quality problems.

How to measure it: Tag deployments that are reactive bugfixes (not planned improvements) and calculate the percentage of the total.

Beyond DORA: The SPACE Framework

DORA metrics measure software delivery system performance. But engineering team productivity is broader than that. The SPACE framework, developed by Nicole Forsgren (the same researcher behind DORA) along with teams from GitHub and Microsoft Research, expands the perspective.

SPACE proposes five dimensions of developer productivity:

Satisfaction and Well-Being

What it captures: How the team feels about their work, tools, and environment.

Why it matters: Satisfaction is a leading indicator of turnover and a direct factor in quality. A dissatisfied team produces worse code, collaborates less, and eventually leaves.

How to measure it:

  • Anonymous quarterly surveys (eNPS, Developer Experience Survey)
  • Regular one-on-one interviews
  • Turnover analysis and exit reasons

Concrete metrics:

  • Developer Net Promoter Score (eNPS)
  • Satisfaction score with tools and processes
  • Voluntary turnover rate

Performance

What it captures: The outcomes of work, not the output. Not how much code is produced, but what impact it has.

How to measure it:

  • Impact on product metrics (engagement, conversion, retention)
  • Quality measured by production defects
  • System reliability (uptime, latency)

Activity

What it captures: The volume of observable work: commits, PRs, code reviews, documentation.

Fundamental caution: Activity is the most dangerous dimension to measure in isolation. Measuring commits or PRs as a productivity indicator leads directly to metric inflation: smaller, more frequent commits that do not deliver more value.

Correct use: As context for the other dimensions, never as a primary metric.

Communication and Collaboration

What it captures: How the team works together. Code review quality, pair programming frequency, participation in technical decisions.

How to measure it:

  • Average response time to code reviews
  • Knowledge distribution (how many people can work on each component)
  • Technical documentation quality

Efficiency and Flow

What it captures: The ability to complete work without unnecessary interruptions. Time in flow state, ratio of planned work vs interruptions.

How to measure it:

  • Percentage of planned vs reactive work
  • Number of context switches per day/week
  • Average time to complete a user story

What NOT to Measure (and Why)

Lines of Code

It is the most intuitive metric and the most destructive. A developer who fixes a bug by removing 500 lines produces more value than one who adds 2,000 lines of mediocre code. Measuring lines of code incentivizes verbosity, not quality.

Number of Commits

Similar to lines of code. Measuring commits incentivizes smaller, more frequent commits, which has no correlation with real productivity. Moreover, with AI tools generating code, the number of commits is even less meaningful.

Hours Worked

Sitting in front of a computer for 12 hours does not indicate productivity. It indicates presence. A developer who works 6 focused hours and produces quality code is more productive than one who works 10 hours with constant interruptions.

Sprint Velocity in Isolation

Story points are an internal team planning tool, not a productivity metric. Comparing velocity between teams is particularly absurd: each team calibrates story points differently.

Individual Activity Metrics

Creating rankings of “who makes the most commits” or “who closes the most tickets” destroys collaboration and favors individual work at the expense of teamwork.

Practical Implementation: Where to Start

Phase 1: Establish the Baseline (2-4 Weeks)

Before trying to improve anything, you need to know where you are.

Actions:

  1. Implement tracking for the 4 main DORA metrics. Most CI/CD platforms already collect the necessary data.
  2. Conduct a team satisfaction survey (it can be as simple as 5 questions in a form).
  3. Document your current delivery process without trying to change it.

Result: A snapshot of your current performance that will serve as reference.

Phase 2: Identify Bottlenecks (2-4 Weeks)

With the baseline established, analyze where the problems are.

Key questions:

  • Is your lead time long because code review takes too long, because tests are slow, or because you only deploy in specific windows?
  • Is your change failure rate high because there are no tests, because the tests are bad, or because deployments are too large?
  • Is your deployment frequency low due to technical limitations or culture (“we only deploy on Thursdays”)?

Phase 3: Implement Focused Improvements (Quarterly)

Each quarter, choose 1-2 metrics to improve. Do not try to improve everything at once.

Example:

  • Problem: Lead time of 2 weeks. The breakdown shows 8 days are lost waiting for code review.
  • Action: Implement a “code review in less than 4 business hours” policy with reviewer rotation.
  • Expected result: Reduce lead time to 1 week.

Phase 4: Review and Adjust (Ongoing)

Review metrics monthly. The goal is not to reach “elite” across the board, but to continuously improve from where you are.

Important: DORA metrics are correlated. Improving one usually improves the others. But forcing a single metric in isolation can degrade the rest. If you reduce lead time by eliminating code reviews, your change failure rate will increase.

The Metrics Trap: Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.”

This is the biggest risk of any metrics programme. The moment a developer knows they are evaluated by deployment frequency, they will optimize for deploying more frequently, not for delivering more value.

How to avoid it:

  • Never use individual metrics for performance evaluation. DORA metrics are team and system metrics, not individual metrics.
  • Measure constellations, not individual metrics. DORA metrics work because they balance each other. Measuring only deployment frequency without change failure rate incentivizes reckless deployments.
  • Use metrics to understand, not to judge. Metrics are conversation starters, not verdicts. “Our lead time increased 30% this month: what happened?” is a useful question. “Your lead time is high: improve it” is a useless order.

How to Present Metrics to Non-Technical Stakeholders

For the CEO

The CEO needs to know whether the investment in engineering is producing results. They do not need to understand what a lead time is.

Translate DORA metrics into business language:

  • “Deployment frequency” becomes “ability to respond quickly to market needs”
  • “Lead time” becomes “time from when we decide something to when users have it”
  • “Change failure rate” becomes “product reliability”
  • “Recovery time” becomes “how long users are without service when there is a problem”

Recommended format: A monthly dashboard with 4-5 metrics that includes trends (improving, stable, worsening) and the 2-3 actions the team is taking to improve.

For the Board and Investors

Investors want to know whether the technical team is competent and whether the technology investment is well spent.

Metrics to present quarterly:

  • Delivery velocity (features delivered vs planned)
  • System reliability (uptime, major incidents)
  • Team efficiency (cost per feature delivered, trend)
  • Team health (retention, satisfaction)

Do not present: Raw DORA metrics. The board does not know what “lead time for changes of 3 days” means and has no context to evaluate it. Always translate to business impact.

For the Technical Team

The technical team needs actionable metrics that help them improve their daily work.

Recommended format: Real-time dashboard accessible to the entire team. No individual rankings. Historical trends so the team can see their progress. Monthly review in a team meeting where bottlenecks are discussed and concrete actions are decided.

Metrics Specific to Teams Using AI

The 2025 DORA report dedicated its entire analysis to AI adoption in software development. The main findings:

  • Teams using AI tools show higher throughput (more code produced, shorter lead times).
  • But they also show higher instability (higher change failure rates).
  • Perceived quality of AI-generated code varies significantly depending on how it is used.

Additional metrics for teams using AI:

  • AI-assisted change failure rate: Failure rate specific to AI-generated or AI-assisted changes vs purely human changes
  • Review effectiveness on AI code: How many bugs are detected in code review of AI-generated code
  • Rework rate on AI-generated code: How often AI-generated code needs to be rewritten

Common Pitfalls When Starting a Metrics Programme

Starting with Too Many Metrics

The temptation is to measure everything from day one. Resist it. Start with the four core DORA metrics. Add SPACE dimensions one at a time after the DORA baseline is established. A team drowning in metrics dashboards will ignore all of them.

Not Having a Baseline Before Setting Targets

Setting improvement targets before you know where you stand is guessing, not planning. Spend the first month just measuring without trying to improve. The baseline will likely reveal surprises that change your priorities.

Measuring Without Acting

Dashboards are not a strategy. If you collect metrics but never discuss them, never identify bottlenecks, and never implement improvements, the metrics programme is overhead without value. Every measurement should connect to a decision or action.

Changing Multiple Variables at Once

If you implement code review automation, reduce deployment batch sizes, and add more tests in the same quarter, you will not know which change improved (or degraded) your metrics. Change one thing at a time so you can attribute cause and effect.

Giving Up Too Early

Meaningful improvements in DORA metrics typically take 2-3 quarters to materialize. If you expect dramatic changes in the first month, you will be disappointed and may abandon the programme before it delivers value.

Conclusion

Engineering metrics are not an end in themselves. They are a tool for having informed conversations about how to improve the way your team delivers software.

Start with the four DORA metrics as your foundation. Add dimensions from the SPACE framework for a more complete picture. And above all, resist the temptation to use metrics as a tool for individual control. Metrics that destroy team trust destroy more value than any optimization can create.

The goal is not to have the best metrics. It is to have a team that delivers quality software predictably and sustainably. Metrics are the map, not the destination.


Want to implement engineering metrics in your team without falling into the usual traps?

In a free 45-minute audit we can help you:

  • Evaluate your current software delivery performance
  • Identify the main bottlenecks in your process
  • Design a metrics programme adapted to your team and stage
  • Set realistic quarterly improvement targets

Request free audit

Back to Blog

Related Posts

View All Posts »