On Fri, Jun 29, 2018 at 1:14 PM, Devin Christensen <quixoten@xxxxxxxxx> wrote: >> From your description it sounds like it's happening in the middle of >> streaming, right? > > Correct. None of the instances in the chain experience a crash. Most of the > time I see the "incorrect resource manager data checksum in record" error, > but I've also seen it manifested as: > > invalid magic number 8813 in log segment 000000030000AEC20000009C, offset > 15335424 I note that that isn't at a segment boundary. Is that also the case for the other error? One theory would be that there is a subtle FS cache coherency problem between writes and reads of a file from different processes (causality), on that particular stack. Maybe not too many programs pass data through files with IPC to signal progress in this kinda funky way, but that'd certainly be a violation of POSIX if it didn't work correctly and I think people would know about that so I feel a bit silly suggesting it. To follow that hypothesis to the next step: I suppose it succeeds after you restart because it requests the whole segment again and gets a coherent copy all the way down the chain. Another idea would be that our flush pointer tracking and IPC is somehow subtly wrong and that's exposed by different timing leading to incoherent reads, but I feel like we would know about that by now too. I'm not really a replication expert, so I could be missing something simple here. Anyone? >> I did find this similar complaint that involves an ext4 primary and a >> btrfs replica: > > It is interesting that my issue occurs on the first hop from ZFS to ext4. I > have not seen any instances of this happening going from the ext4 primary to > the first ZFS replica. I happen to have a little office server that uses ZFS so I left it chugging through a massive pgbench session with a chain of 3 replicas while I worked on other stuff, and didn't see any problems (no ext4 involved though, this is a FreeBSD box). I also tried --wal-segsize=1MB (a feature of 11) to get some more frequent recycling happening just in case it was relevant. >> We did have a report recently of ZFS recycling WAL files very slowly > > Do you know what version of ZFS that effected? We're currently on 0.6.5.6, > but could upgrade to 0.7.5 on Ubuntu 18.04 I think that issue is fundamental/all versions, and has something to with the record size (if you have 128KB ZFS records and someone writes 8KB, it probably needs to read a whole 128KB record in, whereas with ext4 et al you have 4KB blocks and the OS can very often skip reading it in because it can see you're entirely overwriting blocks), and possibly the COW design too (I dunno). Here's the recent thread, which points back to an older one, from some Joyent guys who I gather are heavy ZFS users: https://www.postgresql.org/message-id/flat/CACPQ5FpEY9CfUF6XKs5sBBuaOoGEiO8KD4SuX06wa4ATsesaqg%40mail.gmail.com There was a ZoL bug that made headlines recently but that was in 0.7.7 so not relevant to your case. -- Thomas Munro http://www.enterprisedb.com