On 2018-08-04 13:34:04 -0600, Brent Kerby wrote: > On Postgres 10.3 (on AWS RDS), I am running logical decoding using the > test_decoding output plugin, and every few minutes I am seeing pauses in > the stream, unrelated to any large transactions. About once every hour or > two, the pause is long enough that the database disconnects my client due > to exceeding wal_sender_timeout (30 seconds -- the RDS default value); > after reconnecting it is able to make progress again. My client is using > the streaming replication protocol via pgjdbc (with a status interval of 1 > second). What I'm seeing is that during such a pause, the server is not > sending any data to the client: > > - pg_stat_replication.sent_lsn stops advancing > - My client is blocking in a call to PGReplicationStream.read() > - pg_stat_activity shows that the walsender process has a wait_event of > 'SnapbuildSync'. > > In this scenario, it makes sense that the client would be timed out: pgjdbc > only sends feedback to the server at the beginning of a call to > PGReplicationStream.read(), so if a single call blocks a long time, never > receiving any data from the server, then the client would stop sending > feedback to the server, causing timeout. > > My question is why might the server be spending so much time waiting on > SnapbuildSync? The docs describe this event as follows: > > "IO / SnapbuildSync / Waiting for a serialized historical catalog snapshot > to reach stable storage." > I gather that this is related to statistics collection, but I'm not > understanding why a walsender process would wait on such an event nor why > it would take such a long time. Any ideas? It's *not* related to statistics collection. It's fsyncing a few *small* files to disk. On a first blush this seems to indicate that your storage system is quite overloaded? The relevant piece of code is: /* * fsync the file before renaming so that even if we crash after this we * have either a fully valid file or nothing. * * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has * some noticeable overhead since it's performed synchronously during * decoding? */ pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_SYNC); if (pg_fsync(fd) != 0) { int save_errno = errno; CloseTransientFile(fd); errno = save_errno; ereport(ERROR, (errcode_for_file_access(), errmsg("could not fsync file \"%s\": %m", tmppath))); } pgstat_report_wait_end(); CloseTransientFile(fd); Greetings, Andres Freund