Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

james harvey <jamespharvey20@xxxxxxxxx> · Tue, 5 Jun 2018 17:52:38 -0400

On Tue, Jun 5, 2018 at 4:03 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, 05 Jun 2018 18:01:36 +0000 bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>>
>>             Bug ID: 199931
>>            Summary: systemd/rtorrent file data corruption when using echo
>>                     3 >/proc/sys/vm/drop_caches
>
> A long tale of woe here.  Chris, do you think the pagecache corruption
> is a general thing, or is it possible that btrfs is contributing?
...
>> We found that
>>
>>    echo 3 >/proc/sys/vm/drop_caches
>>
>> causes file data corruption. We found this because we saw systemd journal
>> corruption (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=897266) and
>> tracked this to a cron job dropping caches every hour. The filesystem in use is
>> btrfs, but I don't know if it only happens with this filesystem. btrfs scrub
>> reports no problems, so this is not filesystem metdata corruption.
...
>> This is not always reproducible, but when deleting our journal, creating log
>> messages for a few hours and then doing the above manually has a ~50% chance of
>> corrupting the journal.
...

This sounds a lot related to what Qu Wenruo (as the BTRFS expert and
patch writer) and I (from a reporter and research standpoint) have
been working on, but with a different twist.

My strong bet is you have a hardware issue.  Something like a drive
going bad, bad cables, bad port, etc.  My strong bet is you're also
using BTRFS mirroring.

You're describing intermittent data corruption on files that I'm
thinking all have NOCOW turned on.  On BTRFS, journald turns on NOCOW
for its journal files.  It makes an attempt to turn COW back on when
it's done writing to a journal file, but in a way that guarantees it
to fail.  This has been reported to systemd at
https://github.com/systemd/systemd/issues/9112 but poettering has
expressed the desire to leave it the way it is rather than fix it.
(Granted the situation is going to be improved in the context of the
compression/replace bugs described below, by submitted patches, but
leaving the situation of other on-disk data corruption.)  My bet is
your torrent downloads also have NOCOW turned on.

When NOCOW is turned on, BTRFS also stops performing checksumming of
the data.  (Associated metadata is still checksummed.)

If your BTRFS volume uses mirroring, and you have corruption on one
mirror but not the other, you will get correct or corrupted data
pseudo-randomly depending on which disk is read from.

If your BTRFS volume doesn't use mirroring, then if it's a new file
still in the cache, it won't be corrupted, and after dropping the
cache and re-reading it, if you have a hardware issue, you'll be
reading a corrupted copy.  But, I suspect you are using mirroring, or
else you'd probably be getting unfixable checksum errors on COW files
as well.

Where with checksums and mirroring BTRFS would automatically recognize
a bad read, try the other mirror, and correct the bad copy, with NOCOW
on, even with mirroring, BTRFS has no way to know the data read is
corrupted.

The context I ran into this problem was with several other bugs
interacting, that "btrfs replace" has been guaranteed to corrupt
non-checksummed (NOCOW) compressed data, which the combination of
those shouldn't happen, but does in some defragmentation situations
due to another bug.  In my situation, I don't have a hardware issue.

If you're using BTRFS mirroring, there's an easy way for you to see if
I'm right.  Additions to btrfs-tools are in the works to detect this,
but you can manually do it in the meantime.

Run "filefrag -v <path-filename a file you're having intermittent
corruption on>".

This isn't the ideal tool for the job (btrfs-debug tree is) but it
will more quickly show you the starting block number and length of
blocks for each extent of your file.

For each extent line listed, run 2 commands: "btrfs-map-logical -l
<4096 * physical_offset first (starting) number> -b <4096 * length> -c
1 -o <physical_offset>.1"; and the same but ending "-c 2 -o
<physical_offset>.2".

So, if filefrag shows:
0: 0.. 23: 1201616.. 1201639: 24: last,shared,eof

You'd run (again, for each extent line, with appropriate -l and -b
values and output file name):
btrfs-map-logical -l 4921819136 -b 98304 -c 1 -o 4921819136.1
btrfs-map-logical -l 4921819136 -b 98304 -c 2 -o 4921819136.2

(If you are using BTRFS compression, and a flags column includes
"encoded", you want to use "-b 4096" because filefrag doesn't report
the proper ending physical_offset and length in this situation, and
they're always 4096 bytes.)

This will read each of the extents in your file from both mirrored
copies, and write them to separate files.

Then compare each set of <physical_offset>.1 and <physical_offset>.2 files.

They should never be different.  If they are, for one reason or
another, your mirrored copies differ, and you've found why dropping
cache causes an intermittent problem.