• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Technical Issues - TPU Main Site & Forum (2024)

Status
Not open for further replies.
Will share more content and information about today's downtime later, once things have recovered enough
 
Damn @W1zzard. Hope it wasn't malicious.
 
Uh oh, stinky!

Will share more content and information about today's downtime later, once things have recovered enough
Seems like you had to do a partial rollback/restore?
 
I have never seen a downtime like this one. It was a Bad Gateway 502. Then updated to several servers went down and an 11:00 UTC time at least before fixed and then updated to 13:00 UTC.


502 Bad Gateway​


This will take a while, we lost several servers. At least until 11:00 UTC
 
last time I saw TPU down this long was the time TPU got taken down by the FBI.
 
I blame ARF for posting about runny toothpaste in science subforum
 
WTH????

When was this?

quote”
from “ www. f b i. g o v / contact-us/ international-offices#:~:text=Authorities%20and%20Jurisdiction,invited%20by%20the%20host%20country.


Does the FBI have jurisdiction outside the US?


Authorities and Jurisdiction

A number of U.S. federal laws give the FBI authority to investigate extraterritorial criminal and terrorist activity. The FBI, however, conducts investigations abroad only when invited by the host country.”

so it was not the FBI, but on behalf of a European country… IMO
 
  • Like
Reactions: HTC
What was the reason: do you know?

April fools.

I have no memory of that at all.

Apparently people got so mad at w1zzard for doing it that he never did the same prank again.

So If you were there. You witnessed a great slice of TPU history.


::EDIT::

Im surprised you guys got so deep into it as If TPU was a front for running cocaine into columbia and W1zzard was the criminal mastermind behind it or something.
 
Last edited:
Is it accessible via waybackmachine or something?
Go on there and find out, Im sure you will find a calandar date they backed up the info

quote”
from “ www. f b i. g o v / contact-us/ international-offices#:~:text=Authorities%20and%20Jurisdiction,invited%20by%20the%20host%20country.


Does the FBI have jurisdiction outside the US?


Authorities and Jurisdiction

A number of U.S. federal laws give the FBI authority to investigate extraterritorial criminal and terrorist activity. The FBI, however, conducts investigations abroad only when invited by the host country.”

so it was not the FBI, but on behalf of a European country… IMO
TPU servers are in The US last I recall.
 
April fools.



Apparently people got so mad at w1zzard for doing it that he never did the same prank again.

So If you were there. You witnessed a great slice of TPU history.


::EDIT::

Im surprised you guys got so deep into it as If TPU was a front for running cocaine into columbia and W1zzard was the criminal mastermind behind it or something.
I don’t remember the FBI one but I do remember them posting an April fools about selling TPU. It was wild.
 
Is it accessible via waybackmachine or something?

Dont know. I cant even remember which year it was but it was 100% April Fools. (1st of April) so check every 1st of April going back the last 5-6 years

TPU servers are in The US last I recall.

I was told they are spread out all over the globe. Ive been told some were in hong kong too but that might have changed by now. W1zzard told me this years ago when I was being curious.

I think i was being curious because I either looked up TPUs DNS or IP address and that lead me to some server hosting originating in Hong Kong. I cant remember that it all that clearly but it was a real long time ago.
 
I don’t remember the FBI one but I do remember them posting an April fools about selling TPU. It was wild.
There may come a day he might do that...
 
Alright .. finally .. first of all, this outage wasn't caused by any external/DDOS/hacking.

What happened was:
  • I wanted to run a database query on our banner impressions logs.
  • That table contains A LOT of rows, one for each banner impressions shown in like a year
  • So I wanted to reduce the working set to just August by copying that month into a separate table: INSERT INTO .. (SELECT .. FROM .. WHERE timestamp > "2024-08-01-01")
  • The query still took forever to run, I worked on something else, Zen 5 memory scaling, SSD review, GPU review, but after like an hour I got impatient and decided to solve it differently, so I used KILL to kill the query
  • As soon as the query was killed, MySQL started executing a rollback to undo the rows that it inserted in the new table (I probably somehow thought I was running SELECT, not INSERT, so no rollback expected)
  • At this point I realized that I mistyped the timestamp (2x "-01"), so it actually was actually copying ~70 GB of small rows into a temp table, and was now rolling them back one-by-one
  • We're running a 3-node Galera cluster, so this caused extra load across the cluster, network, disk, CPU
  • At some point one of the DB nodes crashed.. 2 out of 3 is still a good cluster size
  • The crashed node got auto-restarted, but was unable to rejoin the cluster, because the other nodes were still busy doing the rollback
  • I also saw log messages related to DDL statements, which acquire an exclusive lock on the cluster, new plan: "have you tried turning it off and on again?"
  • I took down the whole DB cluster and tried to manually bring up a single node as primary, to add the other nodes afterwards
  • When I did that, MySQL insisted that it had to finishing rolling back the transactions, so I let it .. took like an hour, I still wasn't sure if this would solve the problem
  • At this point I started digging up our our DB backups and thought about options to restore in case of total failure
  • For the past months I've been working on a migration to Kubernetes and MySQL Cluster with Group Replication (no more Galera)
  • I had a 3-node Group Replication cluster running in producting with a subset of our database, so I started restoring the backup to that cluster
  • On the main DB cluster, things were still moving slooooowly ..
    1725285518234.png
  • Now that I had some rough ETA I updated our 502 Servers Down message, so that people would stop trying to reach out to me "hey wizz, are you aware that TPU is down?"
  • Once the rollback completed, I still couldn't get the single MySQL Galera node into primary mode, it was always read-only
  • I tried everything, no go, so I decided to focus on restoring from backup
  • This went mostly smooth, except for some minor issues because Oracle MySQL 8.x isn't exactly 100% compatible to MariaDB MySQL 10.x
  • Fixed them all, site is back up
  • Ads are still disabled because the backup for that huge ads log table is still restoring

I was told they are spread out all over the globe
Our download severs are spread around the globe, the main infrastructure (that creates the pages in your browser) is in NYC, because that's geographically closest to the average of our audience. Backups are multi-site, multi-continent
 
Alright .. finally .. first of all, this outage wasn't caused by any external/DDOS/hacking.

What happened was:
  • I wanted to run a database query on our banner impressions logs.
  • That table contains A LOT of rows, one for each banner impressions shown in like a year
  • So I wanted to reduce the working set to just August by copying that month into a separate table: INSERT INTO .. (SELECT .. FROM .. WHERE timestamp > "2024-08-01-01")
  • The query still took forever to run, I worked on something else, Zen 5 memory scaling, SSD review, GPU review, but after like an hour I got impatient and decided to solve it differently, so I used KILL to kill the query
  • As soon as the query was killed, MySQL started executing a rollback to undo the rows that it inserted in the new table (I probably somehow thought I was running SELECT, not INSERT, so no rollback expected)
  • At this point I realized that I mistyped the timestamp (2x "-01"), so it actually was actually copying ~70 GB of small rows into a temp table, and was now rolling them back one-by-one
  • We're running a 3-node Galera cluster, so this caused extra load across the cluster, network, disk, CPU
  • At some point one of the DB nodes crashed.. 2 out of 3 is still a good cluster size
  • The crashed node got auto-restarted, but was unable to rejoin the cluster, because the other nodes were still busy doing the rollback
  • I also saw log messages related to DDL statements, which acquire an exclusive lock on the cluster, new plan: "have you tried turning it off and on again?"
  • I took down the whole DB cluster and tried to manually bring up a single node as primary, to add the other nodes afterwards
  • When I did that, MySQL insisted that it had to finishing rolling back the transactions, so I let it .. took like an hour, I still wasn't sure if this would solve the problem
  • At this point I started digging up our our DB backups and thought about options to restore in case of total failure
  • For the past months I've been working on a migration to Kubernetes and MySQL Cluster with Group Replication (no more Galera)
  • I had a 3-node Group Replication cluster running in producting with a subset of our database, so I started restoring the backup to that cluster
  • On the main DB cluster, things were still moving slooooowly ..
    View attachment 361780
  • Now that I had some rough ETA I updated our 502 Servers Down message, so that people would stop trying to reach out to me "hey wizz, are you aware that TPU is down?"
  • Once the rollback completed, I still couldn't get the single MySQL Galera node into primary mode, it was always read-only
  • I tried everything, no go, so I decided to focus on restoring from backup
  • This went mostly smooth, except for some minor issues because Oracle MySQL 8.x isn't exactly 100% compatible to MariaDB MySQL 10.x
  • Fixed them all, site is back up
  • Ads are still disabled because the backup for that huge ads log table is still restoring


Our download severs are spread around the globe, the main infrastructure (that creates the pages in your browser) is in NYC, because that's geographically closest to the average of our audience. Backups are multi-site, multi-continent
Thank you for the transparency @W1zzard
 
I have no idea what that means, but well done, you fixed it!
 
So the biggest threat to TPU is not TomsHardware but W1zzard himself. :toast:

Nothing To See Here GIF by Giphy QA
 
Status
Not open for further replies.
Back
Top