skip to content
Back

Recovery-Oriented Computing

/ 2 min read

Updated:

Someone on Bluesky pointed me to recovery-oriented computing, when we discussed Erlang.

So I decided to read up on it. It’s fascinating!

Instead of trying to prevent all failures, ROC accepts that hardware faults, software bugs, and human errors are inevitable facts of life - not problems we can completely solve. The focus shifts to recovering quickly when things go wrong, rather than trying to make systems that never fail.

Based on the paper, Recovery Oriented Computing (ROC) represents a fundamental shift in how we think about computer systems reliability. Here’s the core idea:

Instead of trying to prevent all failures, ROC accepts that hardware faults, software bugs, and human errors are inevitable facts of life - not problems we can completely solve. The focus shifts to recovering quickly when things go wrong, rather than trying to make systems that never fail.

The key principles are:

  1. Recovery Time Matters Most

    • Traditional approaches focus on Mean Time To Failure (MTTF)
    • ROC focuses on Mean Time To Recovery (MTTR)
    • Reducing recovery time by 10x improves availability as much as making failures 10x less frequent
  2. Human Error is Central

    • Operators cause 50-60% of system outages
    • Systems need to help humans recover from mistakes
    • Features like “undo” for system administrators are crucial
  3. Testing Recovery is Essential

    • Recovery mechanisms need testing just like regular features
    • Systems should allow controlled fault injection
    • Operators need practice recovering from failures
  4. Design for Recovery

  • Break systems into isolated parts that can fail independently
  • Build in ways to diagnose problems quickly
  • Use storage systems that preserve old states for recovery
  • Add safety margins to handle unexpected issues

The main insight is: “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time.”

Source: Recovery-Oriented Computing