Alright .. finally .. first of all, this outage wasn't caused by any external/DDOS/hacking.
What happened was:
- I wanted to run a database query on our banner impressions logs.
- That table contains A LOT of rows, one for each banner impressions shown in like a year
- So I wanted to reduce the working set to just August by copying that month into a separate table: INSERT INTO .. (SELECT .. FROM .. WHERE timestamp > "2024-08-01-01")
- The query still took forever to run, I worked on something else, Zen 5 memory scaling, SSD review, GPU review, but after like an hour I got impatient and decided to solve it differently, so I used KILL to kill the query
- As soon as the query was killed, MySQL started executing a rollback to undo the rows that it inserted in the new table (I probably somehow thought I was running SELECT, not INSERT, so no rollback expected)
- At this point I realized that I mistyped the timestamp (2x "-01"), so it actually was actually copying ~70 GB of small rows into a temp table, and was now rolling them back one-by-one
- We're running a 3-node Galera cluster, so this caused extra load across the cluster, network, disk, CPU
- At some point one of the DB nodes crashed.. 2 out of 3 is still a good cluster size
- The crashed node got auto-restarted, but was unable to rejoin the cluster, because the other nodes were still busy doing the rollback
- I also saw log messages related to DDL statements, which acquire an exclusive lock on the cluster, new plan: "have you tried turning it off and on again?"
- I took down the whole DB cluster and tried to manually bring up a single node as primary, to add the other nodes afterwards
- When I did that, MySQL insisted that it had to finishing rolling back the transactions, so I let it .. took like an hour, I still wasn't sure if this would solve the problem
- At this point I started digging up our our DB backups and thought about options to restore in case of total failure
- For the past months I've been working on a migration to Kubernetes and MySQL Cluster with Group Replication (no more Galera)
- I had a 3-node Group Replication cluster running in producting with a subset of our database, so I started restoring the backup to that cluster
- On the main DB cluster, things were still moving slooooowly ..
View attachment 361780
- Now that I had some rough ETA I updated our 502 Servers Down message, so that people would stop trying to reach out to me "hey wizz, are you aware that TPU is down?"
- Once the rollback completed, I still couldn't get the single MySQL Galera node into primary mode, it was always read-only
- I tried everything, no go, so I decided to focus on restoring from backup
- This went mostly smooth, except for some minor issues because Oracle MySQL 8.x isn't exactly 100% compatible to MariaDB MySQL 10.x
- Fixed them all, site is back up
- Ads are still disabled because the backup for that huge ads log table is still restoring
Our download severs are spread around the globe, the main infrastructure (that creates the pages in your browser) is in NYC, because that's geographically closest to the average of our audience. Backups are multi-site, multi-continent