Service Outage: Power Loss at Primary Data Center Due to Backup Power Failure
- July 20, 2015
- 2 Comments
This morning the Softlayer datacenter where DPD’s primary servers are located experienced power loss.
The Softlayer datacenter has both battery and backup generator backup power supplies, but the automatic transfer switch for the backup generators failed, resulting in DPD’s servers losing power when the backup battery supplies were exhausted.
This resulted in an unclean shutdown of DPD’s primary database server. As a result, the database was corrupted on disk. DPD maintains regular backups of database data and we immediately worked to restore the latest backup that happened approximately 39 minutes before the power loss event.
DPD is now up and fully operational.
We are working to restore the missing data that happened between the time the last backup happened at 6:25 AM EST and the power loss event at 7:04 AM EST. We hope to restore this data within the next 24-48 hours but we’ll need to extract the data from the corrupt database and insert it in to the new restored database, so it will take some time to complete.
Power Outage Event Timeline:
All times Eastern
- 6:25 : Last regular database backup.
- 7:04 : Our datacenter received a power outage. Redundant power failed to take over before battery backups were exhausted. The pod our servers are located in then lost power.
- 9:40 : Power was restored.
- 9:41 : We discovered an anomaly with the database. Further investigation shows on-disk corruption from the power outage.
- 10:10 : It was determined that the database was not recoverable in a reasonable time and we would restore from backup.
- 10:45 : Recovery complete. Approximately 39 minutes of data was lost in the recovery.
- Ongoing: We’re working to restore the missing data between the last backup and the outage. Updates will be posted as they happen.
We responded to support requests through our help desk, email, Twitter, and Facebook during the outage. We are looking into better ways of relaying system status to you in the future, including a 3rd party status monitor everyone can check.