(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 05 Jun 2018 18:01:36 +0000 bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=199931 > > Bug ID: 199931 > Summary: systemd/rtorrent file data corruption when using echo > 3 >/proc/sys/vm/drop_caches A long tale of woe here. Chris, do you think the pagecache corruption is a general thing, or is it possible that btrfs is contributing? Also, that 4.4 oom-killer regression sounds very serious. > Product: Memory Management > Version: 2.5 > Kernel Version: 4.14.33 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > Assignee: akpm@xxxxxxxxxxxxxxxxxxxx > Reporter: bugzilla.kernel.org@xxxxxxxx > Regression: No > > We found that > > echo 3 >/proc/sys/vm/drop_caches > > causes file data corruption. We found this because we saw systemd journal > corruption (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=897266) and > tracked this to a cron job dropping caches every hour. The filesystem in use is > btrfs, but I don't know if it only happens with this filesystem. btrfs scrub > reports no problems, so this is not filesystem metdata corruption. > > Basically: > > # journalctl --verify > [everything fine at this point] > # echo 3 >/proc/sys/vm/drop_caches > # journalctl --verify > [journalctl now reporting corruption problems] > > This is not always reproducible, but when deleting our journal, creating log > messages for a few hours and then doing the above manually has a ~50% chance of > corrupting the journal. > > After investigating we found that rtorrent also suffers from corrupted > downloads when using the above echo - basically, downloading torrents is fine, > except when executing the above echo a few times during a download, after which > rtorrent very likely reports a failed hash check. > > All of this is reproducible on two different boxes, so is unlikely to be a > hardware issue. > > On one affected server we have over 50TB of files, many that have been created > with the cronjob in place, and none of them are corrupted (we have md5sums of > everything), so it seems to be related to something that systemd and rtorrent > do, rather than a generic file corruption issue. > > I also was able to "cmp -l" two corrupted files with their correct version, and > the corruption manifests itself as streaks of ~100-3000 zero bytes instead of > the real data. The start offset sems random, but the end offset seems to be > always aligned to a 4K offset - speculating without the hindrance of knowledge > this feels like a race somewhere between writing to a mmapped area and freeing > it, or so. > > Here is the output of cmp -l between a working and a corrupted file, for two > files: > > http://data.plan9.de/01.cmp.txt > http://data.plan9.de/02.cmp.txt > > We also have a mysql database with hundreds of gigabytes of writes per day on > one server which also does not seem to suffer from any corruption. > > As for why we would do something silly as dropping the caches every hour (in a > cronjob), we started doing this recently because after kernel 4.4, we got > frequent OOM kills despite having gigabytes of available memory (e.g. 12GB in > use, 20GB page cache and 16GB empty swap and bang, mysql gets killed). We found > that that the debian 4.9 kernel is unusable, and 4.14 works, *iff* we use the > above as an hourly cron job, so we did that, and afterwards run into > rtorrent/journald corruption issues. Without the echo in place, mysql usually > gets oom-killed after a few days of uptime. > > -- > You are receiving this mail because: > You are the assignee for the bug.