A New Tool to Prevent Catastrophic Deletions Like GitLab‘s Database Incident

As a full-stack developer who has been in the trenches of production systems for many years, I know firsthand the stomach drop feeling when you realize you‘ve accidentally deleted something you shouldn‘t have. That sinking sense of dread, followed by frantic scrambling to restore from backups.

Sadly, accidental deletions are all too common. In 2017, GitLab suffered a major outage when an engineer accidentally wiped out their production database during maintenance. Over 300 GB of customer data was lost, taking the service down for over 18 hours while they painstakingly restored from backups.

GitLab is certainly not alone. A 2019 survey by security firm Tripwire found that 59% of organizations had experienced downtime due to human error, with data loss being one of the top consequences. The average cost per hour of downtime? A whopping $300,000 according to Gartner Research.

As our systems become ever more complex and mission critical, it‘s crucial that we have robust safeguards in place to protect against devastating human mistakes. That‘s why I decided to create rm-protection, an open source tool to help catch accidental deletions of key files and directories before they happen.

How rm-protection Works

The core idea behind rm-protection is simple: before deleting a protected file or directory, the user must correctly answer a safety question. This way, deletions can only proceed if the user demonstrates clear intent.

Under the hood, rm-protection is implemented as a Python script that wraps the standard rm command. When invoked on a file or directory, it checks for the presence of a special .rm-protection file in the same directory. If found, it prompts the user with the question specified in the protection file. If the user‘s answer matches the expected answer, the deletion command is passed through to the real rm. If not, the deletion is blocked and an error is displayed.

Here‘s the basic flow:

graph TD
A([rm-protection invoked on path])
A --> B{.rm-protection file exists?}
B --> |No| C[Pass args to real rm command]
C --> D[Done]
B --> |Yes| E[Prompt user with question from .rm-protection]
E --> F{User enters correct answer?}
F --> |Yes| C
F --> |No| G[Print error and exit]
G --> D

To protect a path, you create a .rm-protection file next to it. For example, to protect the database directory /data/prod/db, create a file /data/prod/.db.rm-protection with content like:

question: What environment is this database for?
answer: prod

When rm-protection is invoked on /data/prod/db or any of its subdirectories, it will prompt the user with the specified question:

$ rm-protection -rf /data/prod/db
rm-protection: "/data/prod/db" is protected
What environment is this database for? test
rm-protection: Wrong answer! Aborting deletion.

If the user enters the correct answer, the underlying rm command is invoked to perform the actual deletion:

$ rm-protection -rf /data/prod/db
rm-protection: "/data/prod/db" is protected  
What environment is this database for? prod
rm-protection: Correct answer. Proceeding with deletion...

The actual deletion is performed using the real rm command with all original arguments, so options like -r for recursive deletion are fully supported. And the protection files are just plain text, so they can be easily created and managed in version control alongside your application code or configuration-as-code.

Comparison to Other Approaches

There are some other existing tools and techniques for protecting against accidental deletions on Unix-like systems. How does rm-protection compare?

One classic approach is the -i flag for rm, which enables interactive mode and prompts the user to confirm each deletion. However, this quickly becomes tedious and most users develop a habit of just typing "y" repeatedly or invoking rm -rf to override.

Another tool is safe-rm, which provides a wrapper around rm that checks a configuration file for protected paths and warns if any are specified for deletion. However, it still allows the deletion to proceed after displaying the warning. In the heat of the moment, it‘s all too easy to impatiently skip over the warning messages.

Linux also supports immutable/read-only filesystems and extended file attributes that can be used to prevent modifications. For example, you can set the immutable flag on key files and directories using the chattr command:

$ sudo chattr +i /data/prod/db

With this flag set, the file cannot be deleted or modified, even by the root user. This provides very strong protection, but also requires remembering to unset the flag before performing legitimate modifications. It‘s a rather blunt instrument.

In my experience, the prompting approach of rm-protection strikes a good balance between safety and usability for protecting key paths. It allows deletions to proceed normally for unprotected paths, while forcing the user to slow down and confirm their intent for protected paths. And by making the user answer a specific question, it ensures they have situational awareness of what they‘re deleting.

Integration with DevOps Workflows

One of the key benefits of rm-protection is how easily it integrates with modern DevOps workflows and infrastructure-as-code practices. The protection files are just plain text, so they can be managed in version control right alongside your application code, configuration files, and deployment scripts.

For example, you might have a repository structure like:

myapp/
  src/
    app.js
    config.yaml
  deploy/
    prod/
      Dockerfile
      docker-compose.yaml
      .db.rm-protection

With the .db.rm-protection file checked into version control, it will be included in your deployment artifacts and copied to production servers. This way, the protection travels with the code and is always in place in each environment.

You can even automatically create or update protection files as part of your continuous deployment pipeline. For example, you might have a script that generates .rm-protection files based on your environment-specific configuration files:

#!/bin/bash

# Generate DB protection file
cat > deploy/prod/.db.rm-protection <<EOL
question: What environment is this database for? 
answer: $(cat deploy/prod/config.yaml | grep ‘env:‘ | cut -d‘ ‘ -f2)
EOL

# Rest of your deployment script...

This way, the protection files always stay in sync with your actual deployed configuration.

Best Practices for Using rm-protection

While rm-protection is a valuable tool, it‘s important to use it thoughtfully as part of a defense-in-depth strategy. Some best practices to keep in mind:

  • Protect judiciously. Focus on your most critical paths like databases, configuration files, and source code. Over-protecting can lead to prompt fatigue.
  • Keep it simple. Prefer short, specific questions and answers that are easy to remember and unambiguous. Avoid vague questions like "Are you sure?".
  • Use version control. Check your .rm-protection files into version control so they‘re always in sync with your code and configuration.
  • Promote awareness. Make sure your team is aware of which paths are protected and how rm-protection works. Include it in your onboarding and documentation.
  • Least privilege. Use access controls and principle of least privilege to limit access to production systems. The fewer people with access, the lower the risk of accidents.
  • Secure your protection files. Ensure that your .rm-protection files are owned and only writable by trusted users. A malicious user shouldn‘t be able to modify them to subvert protection.
  • Defense in depth. Use rm-protection in combination with other techniques like regular backups, immutable infrastructure, and auditing. No single tool can prevent all accidents.

Future Enhancements

There are a number of potential enhancements that could make rm-protection even more powerful:

  • Centralized policy management. Allow .rm-protection files to be stored in a central repository and automatically distributed to multiple servers. This would make it easier to consistently apply policies across large fleets.
  • Logging and auditing. Log all rm-protection invocations to a central system for auditing and alerting. This would help quickly identify any suspicious deletion attempts.
  • Integration with orchestration tools. Allow rm-protection policies to be defined and enforced using popular orchestration tools like Ansible or Puppet. This would make it easier to manage protection files as part of your existing configuration management workflows.
  • Scriptable policies. Allow protection files to specify scripts or commands to be run for additional validation. For example, requiring a second person to approve, or checking an external system for permission.

Of course, each additional feature also introduces complexity and new potential failure modes. As with any tool, it‘s important to balance power with maintainability and understandability.

Conclusion

Accidental deletions are a major risk for any organization. With ever more complex systems and high velocity of change, it‘s critical that we put safeguards in place to prevent devastating human mistakes. That‘s where rm-protection comes in.

By prompting users to answer a specific question before deleting protected paths, rm-protection forces them to slow down and confirm intent. It‘s a simple but powerful technique to catch risky deletions before they happen.

Of course, no tool is a silver bullet. True production safety requires a holistic, defense-in-depth approach spanning training, process, and tooling. But I believe rm-protection can be a valuable addition to any DevOps toolbox.

Mistakes will inevitably happen. We‘re only human after all. What matters is how we learn, adapt, and improve. By building safety nets to catch us when we stumble, we can keep moving fast while minimizing costly accidents.

So give rm-protection a try, and see if it helps bring a little more peace of mind to your production systems. Here‘s to fewer pager alerts and more restful nights!

Similar Posts