Sean Laurent <sean@xxxxxxxxxxxxx> writes: > We've been running into a particularly strange problem that I'm trying to > better understand. The super short version is that our application servers > lose their connection to the database when I run a backup during periods of > higher load and fail to reconnect. > Here's an overview of the setup: > - PostgreSQL 9.0.1 hosted on a cc1.4xlarge Amazon EC2 instance running > CentOS 5.6 > - 8 disk RAID-0 array of EBS volumes used for primary data storage > - 4 disk RAID-0 array of EBS volumes used for transaction logs > - Root partition is ext3 > - RAID arrays are xfs > Backups are taken using a script that runs the following workflow: > - Tell Postgres to start a backup: SELECT pg_start_backup('RAID backup'); > - Run "xfs_freeze" on the primary RAID array > - Tell Amazon to take snapshots of each of the EBS volumes > - Run "xfs_freeze -u" to thaw the primary RAID array > - Run "xfs_freeze" on the transaction log RAID array > - Tell Amazon to take snapshots of each of the EBS volumes > - Run "xfs_freeze -u" to thaw the transaction log RAID array > - Tell Postgres the backup is finished: SELECT pg_stop_backup(); > - Remove old WAL files > The whole process takes roughly 7 seconds on average. The RAID arrays are > frozen for roughly 2 seconds on average. > Within a few seconds of the backup, our application servers start throwing > exceptions that indicate the database connection was closed. Meanwhile, > Postgres still shows the connections and we start seeing a really high > number (for us) of locks in the database. The application servers refuse to > recover and must be killed and restarted. Once they're killed off, the > connections actually go away and the locks disappear. That's just weird. It sounds like the "xfs_freeze" operation, or the snapshotting operation, is somehow interrupting network traffic. I'd not expect such a thing on a normal server, but who knows what's connected to what in an Amazon EC2 instance? Anyway, I'd suggest trying to instrument something to prove or disprove that there's a networking failure involved. It might be as simple as watching "ping" behavior ... regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general