We are building neetoDeploy, an alternative for Heroku to easily deploy and manage applications on the cloud. It is built with a mix of Kubernetes and Rails.
We switched our pull-request review apps from Heroku to neetoDeploy a couple of months ago, and it has been doing well. As the next step of the process, we are building features to target staging and production. We had an essential staging set up in March, and by the end of the month, we had migrated staging deployments of all Neeto from Heroku to neetoDeploy. Everything was working fine for a week until I made a grave mistake.
Whenever a PR is opened or a new commit is pushed, neetoDeploy receives a webhook call from Github. Review/staging apps are created/updated on neetoDeploy in response to these webhook calls.
On 2023 April 5, a misconfiguration caused neetoDeploy's webhook handler to malfunction for an hour. As a result, some review apps were not deleted even after the corresponding PRs were closed or merged. The solution was cross-checking review apps with the open PRs and deleting the unwanted apps. Using rails console, this could be done live on the server.
Here is what that solution looked like:
1GithubRepository.find_each do |github_repository| 2 access_token = github_repository.github_integration.access_token 3 github_client = Octokit::Client.new(access_token:) 4 5 open_pr_numbers = github_client 6 .pull_requests(github_repository.name, state: :open) 7 .pluck(:number) 8 9 github_repository.project.apps.find_each do |app| 10 next if open_pr_numbers.include?(app.pr_number) 11 Apps::DestroyService.new(app).process! 12 end 13end
But there is a terrible mistake in the above code. See if you can spot that.
I'm going to wait...
A bit more waiting... Enough waiting; here's the mistake:
There is no filter in the apps that were picked to be destroyed. This snippet was written at a time when we only had review apps. So github_repository.project.apps was expected to return review apps. But we now also had staging apps in the database. And those staging apps weren't filtered out here. After running the snippet and noticing it took longer than expected, I realized the mistake and instantly pressedCTRL + C. Of course, it was taking time since it was deleting all the staging app databases and dynos 🤦.
In the end, out of 33 staging apps, only five remained. And thus started, the procedure to restore all of them.
neetoDeploy already had the feature to do manual DB exports but this wasn't being done routinely. We were only hosting review apps (whose data need not be persisted reliably), and staging had only started just a week before.
We had database backups from a week before (when we ultimately migrated staging apps off Heroku), and one by one, our small team of 4 brought back all the apps in 2 days. The next step is to try not to let this happen again; if it were to happen, we have a contingency plan. We thought of two types of contingency plans:
- Automatic scheduled backups
- Disk snapshots of the DB
Automatic scheduled backups
The idea is that the database would be exported at a particular time every day. Backups older than a month would be deleted automatically to save space.
We implemented this in a week. Every day at 12 AM UTC, all staging+production databases would be exported and uploaded to an S3 bucket.
While this feature was being implemented, I used the rails console to do the export manually of all the apps. The exported file URLs of each DB was manually copied to a text file. aria2c was then used to download them parallelly to a local folder:
1aria2c -c --input-file export_urls.txt
aria2c is a smart downloader. It will resume interrupted downloads, wouldn’t duplicate downloads, and do everything parallel.
Disk snapshots of the DB
The other contingency method is to do periodic snapshots of the volume holding the DB. We are working on this.
You can refer to this blog post of GitLab to know their recovery procedures when they faced a significant data loss in 2017.
The core lesson here is to call destructive methods very carefully. Instead of calling the DestroyService instantly, there could have been an intermediate human check:
1apps =  2GithubRepository.find_each do |github_repository| 3 access_token = github_repository.github_integration.access_token 4 github_client = Octokit::Client.new(access_token:) 5 6 open_pr_numbers = github_client 7 .pull_requests(github_repository.name, state: :open) 8 .pluck(:number) 9 10 github_repository.project.apps.review.find_each do |app| 11 next if open_pr_numbers.include?(app.pr_number) 12 apps.append(app) 13 end 14end
This would populate the list of apps to delete in apps variable, it can be displayed, verified and then we can destroy them individually:
1apps.map do |app| 2 Apps::DestroyService.new(app).process! 3end
The other takeaway here is to have proper recovery mechanisms in place. Human/system errors are possible; we should be prepared when it happens.