Postmortem

When Technology Stumbles: A Closer Look at Last Week's Unplanned Outage

Postmortem

In software engineering system failure is a way to build a robust system and a core part of learning from the failure. This blog post would outline a system failure that I recently experienced and step taken to fix the errors.

Issue Summary

On Tuesday of last week from 2:22 PM to 3:12 PM WAT, the front end to all our web application interfaces was down and unreachable to clients all around the globe this incident exposed a critical loophole in our deployment flow. The root cause is an incorrect WordPress setting file.

Timeline (all in WAT)

  • 2:22 PM Deployment of code to production

  • 2:25 Web server returning 500 error

  • 2:26 PagerDuty alerts the on-call team member

  • 2:30 Rollover begins in other to isolate the error

  • 2:40 Failed Rollover

  • 3:00 Successful configuration update

  • 3:12 Web server online

Root Cause

At 2:22 PM WordPress settings file was changed in other to make them compatible with different local but the changes introduced an error that caused a server error which made any request to our website return a 500(internal server error). The file referenced a WordPress file with the wrong extension. This hinders the generation of subsequent files by WordPress and the server returns a 500 error.

Resolution And Recovery

At 2:26 PM our monitoring service alerted our on-call engineer who escalate the error, which brought the engineering team together to solve the problem. We first attempted a rollover to the state before the error but due to the complexity of the configuration file, this wasn't possible.

Our engineers used a linux tool known as Strace to monitor the system calls made by WordPress when trying to generate the index page we discovered that a crucial file referenced in the setting file is not able to be found due to a typo in its extension.

We corrected the typo which fixed the 500 error and brought WordPress back online at 3:12 PM.

Corrective and Preventive Measures

After the system failure, our team members meditate on how we would prevent this type of error from reoccurring, below are some of the corrective and preventive measures we reached in other to prevent the errors.

  • Write tests that need to pass before any code is released to the production environment

  • Each pull request must be reviewed and merged by another team member before merging to the main branch.

  • Make the rollback technique faster and more efficient.

  • develop a mechanism for faster notification delivery mechanism.

  • Set the intensity level on the monitoring system based on the gravity of the incident.

We are commited to improve our product and services, so as to give you the preventive best experience using our service. We really appreciate your patience with us and we promise to prevent this kind of outage in the future.

sincerely,

somzzy