> First I was also thinking about vacuum. But removing a replication
> slot should have no effect on vacuum on the primary (AFAIK). Please
> correct me if I'm wrong.
>
yeah, depends. there are 2 processes:
* 1 process generating the wal's, maybe a VACUUM
* an inactive slot holding the wals
For instance, if a standby not reachable the wal's will accumulated
within the slot, till the standby is reachable again.
I understand that an unreachable standby can cause WAL files accumulated in the pg_wal directory. This has happened before, and it is expected. What I don't get is the amount and the speed. Write speed went up from the normal 5MB/sec to 1500MB/sec within a minute. When the slot was removed, it went down to normal again. We could have easily solved the problem of a disconnected standby, because free disk space is monitored. But in this case, there was not enough time to react. PostgreSQL filled up the remaining 40% free disk space in a matter of minutes. By the time we got the alert message and logged into the server, it was already too late, the disk was full.
There is a strong correlation between the speed/amount of data written, and the existence of that replication slot. If we drop the slot, then write speed goes down immediately. If we add that slot again, then after some time the problem comes back. (All I can say is that it happened three times.) Interestingly, it does not happen with the other standby - that one is still connected, and works flawlessly. I don't know of any normal PostgreSQL mechanism that could cause this behaviour. We already ruled out client applications, because all client apps were shut down, volume size increased and then PostgreSQL restarted, but did not solve the problem.
Laszlo