Last week, I was sitting on my couch, binging on some new episodes on NetFlix.
A text message? This late? Can't be good.
Grudgingly, I pause NetFlix after a few seconds ("How it's made" is super addicting) to check out the messages…
Fuck! An alert telling me that my site is down. EXCUSE ME! I'm over here trying to learn how Nail Clippers are made. Damnit. Off to the computer.
I sit down and try to load up api.myjob.com and… nothing. Chrome just sits there, the condescending little blue circle thing spinning into oblivion.
My site is DOWN, and it's down hard. Requests are timing out.
What tactics would you use? Some good answers might be…
I logged into one of my PHP servers, grabbed one of the PIDs for a PHP-FPM worker, and ran strace -p
.
What?? Isn't strace a tool for hardcore unix neckbeards?
HELL NO!
If you're not using strace on a reocurring basis, you're doing yourself a MAJOR disservice. It's perfect for quick and dirty debugging. Sure, in an IDEAL world, you'd get a notification telling you exactly which service was down, but I'm a realist and I know that doesn't always happen.
When your site is down, wishing you had better alerting isn't the answer. You need to get your API back up ASAP. Everything else comes second.
Strace is a free unix tool that comes with almost every linux distro out there. You give Strace a PID (in my case, the PID of a PHP-FPM worker) and it instantly hooks into the running process and tells you what kernel system calls it's making.
A system call is like a low-level function that your PHP code makes to the operating system. Things like— "read a file" and "connect to this mysql server" use system calls.
It so happens that when your app is blocking and dead-in-the-water, it's because your PHP code is waiting for a system call to respond. Viola!
So, I log into my webserver over SSH and run ps
to grab a random PHP-FPM process id, like this:
$ ps aux | grep php-fpm
myapp 128131 2.9 0.0 324132 57396 ? R 10:54 11:50 php-fpm: pool www
...
See 128131
? That's a process id I can get a sneak peak into using strace.
Next, I attach strace to the process and see what's up…
$ sudo strace -p 128131
gettimeofday({1399247077, 563898}, NULL) = 0
sendto(17, "SELECT * FROM users WHERE id =", 254, MSG_DONTWAIT, NULL, 0) = 254
I see it waiting on the last line for several seconds. Just, like that, I can immediately see that it's sending a SQL query to the database server and is waiting for the response.. looks like my database server is foobared again!
And just like magic, I'm immediately able to diagnose the downtime. Strace gives me the power to address the database server faster, fix the problem, and get back to watching "How It's Made".
He's the kind of guy that can not only program well, but can also talk for hours about hardcore linux internals. I asked him—
"If you could teach everyone just ONE thing about scaling… what would it be?"
His response? Strace. 1000 times strace. Nothing else can give you so much insight and information about your stack in a single command. You can have all of the monitoring IN THE WORLD, and strace will STILL give you new insights into your application.