On Tue, Jun 05, 2018 at 05:52:38PM -0400, james harvey <jamespharvey20@xxxxxxxxx> wrote: > >> This is not always reproducible, but when deleting our journal, creating log > >> messages for a few hours and then doing the above manually has a ~50% chance of > >> corrupting the journal. > ... > > My strong bet is you have a hardware issue. Strange, what kind of harwdare bug would affect multiple very different computers in exactly the same way? > going bad, bad cables, bad port, etc. My strong bet is you're also > using BTRFS mirroring. Not sure what exactly you mean with btrfs mirroring (there are many btrfs features this could refer to), but the closest thing to that that I use is dup for metadata (which is always checksummed), data is always single. All btrfs filesystems are on lvm (not mirrored), and most (but not all) are encrypted. One affected fs is on a hardware raid controller, one is on an ssd. I have a single btrfs fs in that box with raid1 for metadata, as an experiment, but I haven't used it for testing yet. > You're describing intermittent data corruption on files that I'm > thinking all have NOCOW turned on. The systemd journal files are nocow (I re-enabled that after I turned it off for a while), but the rtorrent directory (and the files in it) are not. I did experiment (a year ago) with nocow for torrent files and, more importantly, vm images, but it didn't really solve the "millions of fragments slow down" problem with btrfs, so I figured I can keep them cow and regularly copy them to defragment them. Thats why I am quite sure cow is switched on long before I booted my first 4.14 kernel (and it still is). > it's done writing to a journal file, but in a way that guarantees it > to fail. This has been reported to systemd at > https://github.com/systemd/systemd/issues/9112 but poettering has I am aware that systemd tries to turn on nocow, and I think this is actually a bug, but this wouldn't have an an effect on rtorrent, which has corruption problems on a different fs. And boy would it be wonderufl if Debian switched away form systemd, I feel I personally ran into every single bug that exists... However, no matter how much systemd plays with btrfs flags, it shouldn't corrupt data. > The context I ran into this problem was with several other bugs > interacting, that "btrfs replace" has been guaranteed to corrupt > non-checksummed (NOCOW) compressed data, which the combination of > those shouldn't happen, but does in some defragmentation situations > due to another bug. In my situation, I don't have a hardware issue. Yeah, btrfs is full of bugs that I constantly run into, but most of them are containable, unlikely this problem, which might or might not be a btrfs bug - especially since all your bets seem to be wrong here. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@xxxxxxxxxx -=====/_/_//_/\_,_/ /_/\_\