Here's an overview of the setup:
- PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running CentOS 5.6
- 8 disk RAID-0 array of EBS volumes used for primary data storage
- 4 disk RAID-0 array of EBS volumes used for transaction logs
- Root partition is ext3
- RAID arrays are xfs
Backups are taken using a script that runs the following workflow:
- Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');
- Run "xfs_freeze" on the primary RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the primary RAID array
- Run "xfs_freeze" on the transaction log RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the�transaction log�RAID array
- Tell Postgres the backup is finished:�SELECT pg_stop_backup();
- Remove old WAL files
The whole process takes roughly 7 seconds on average. The RAID arrays are frozen for roughly 2 seconds on average.
Within a few seconds of the backup, our application servers start throwing exceptions that indicate the database connection was closed. Meanwhile, Postgres still shows the connections and we start seeing a really high number (for us) of locks in the database. The application servers refuse to recover and must be killed and restarted. Once they're killed off, the connections actually go away and the locks disappear.
What's particularly weird is that this doesn't happen all the time. The backups were running every hour, but we have only seen the app servers crash 5-10 times over the course of a month.
Has anyone encountered anything like this? Do any of these steps have ramifications that I'm not considering? Especially something that might explain the app server failure?
Sean Laurent
Director of Operations
StudyBlue, Inc.