Postgres 9.01, Amazon EC2/EBS, XFS, JDBC and lost connections

Sean Laurent <sean@xxxxxxxxxxxxx> · Thu, 6 Oct 2011 12:21:52 -0500

We've been running into a particularly strange problem that I'm trying to better understand. The super short version is that our application servers lose their connection to the database when I run a backup during periods of higher load and fail to reconnect.

Here's an overview of the setup:

- PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running CentOS 5.6
- 8 disk RAID-0 array of EBS volumes used for primary data storage
- 4 disk RAID-0 array of EBS volumes used for transaction logs
- Root partition is ext3
- RAID arrays are xfs

Backups are taken using a script that runs the following workflow:

- Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup');
- Run "xfs_freeze" on the primary RAID array
- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the primary RAID array
- Run "xfs_freeze" on the transaction log RAID array

- Tell Amazon to take snapshots of each of the EBS volumes
- Run "xfs_freeze -u" to thaw the�transaction log�RAID array
- Tell Postgres the backup is finished:�SELECT pg_stop_backup();
- Remove old WAL files

The whole process takes roughly 7 seconds on average. The RAID arrays are frozen for roughly 2 seconds on average.

Within a few seconds of the backup, our application servers start throwing exceptions that indicate the database connection was closed. Meanwhile, Postgres still shows the connections and we start seeing a really high number (for us) of locks in the database. The application servers refuse to recover and must be killed and restarted. Once they're killed off, the connections actually go away and the locks disappear.

What's particularly weird is that this doesn't happen all the time. The backups were running every hour, but we have only seen the app servers crash 5-10 times over the course of a month.

Has anyone encountered anything like this? Do any of these steps have ramifications that I'm not considering? Especially something that might explain the app server failure?

Thanks.

Sean Laurent
Director of Operations
StudyBlue, Inc.