Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

Liu Bo <obuil.liubo@xxxxxxxxx> · Wed, 6 Jun 2018 21:38:25 +0800

On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason <clm@xxxxxx> wrote:
>
>
> On 5 Jun 2018, at 16:03, Andrew Morton wrote:
>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Tue, 05 Jun 2018 18:01:36 +0000 bugzilla-daemon@xxxxxxxxxxxxxxxxxxx
>> wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>>>
>>>             Bug ID: 199931
>>>            Summary: systemd/rtorrent file data corruption when using echo
>>>                     3 >/proc/sys/vm/drop_caches
>>
>>
>> A long tale of woe here.  Chris, do you think the pagecache corruption
>> is a general thing, or is it possible that btrfs is contributing?
>>
>> Also, that 4.4 oom-killer regression sounds very serious.
>
>
> This week I found a bug in btrfs file write with how we handle stable pages.
> Basically it works like this:
>
> write(fd, some bytes less than a page)
> write(fd, some bytes into the same page)
>     btrfs prefaults the userland page
>     lock_and_cleanup_extent_if_need()   <- stable pages
>                 wait for writeback()
>                 clear_page_dirty_for_io()
>
> At this point we have a page that was dirty and is now clean.  That's
> normally fine, unless our prefaulted page isn't in ram anymore.
>
>         iov_iter_copy_from_user_atomic() <--- uh oh
>
> If the copy_from_user fails, we drop all our locks and retry.  But along the
> way, we completely lost the dirty bit on the page.  If the page is dropped
> by drop_caches, the writes are lost.  We'll just read back the stale
> contents of that page during the retry loop.  This won't result in crc
> errors because the bytes we lost were never crc'd.
>

So we're going to carefully redirty the page under the page lock, right?

> It could result in zeros in the file because we're basically reading a hole,
> and those zeros could move around in the page depending on which part of the
> page was dirty when the writes were lost.
>

I got a question, while re-reading this page, wouldn't it read
old/stale on-disk data?

thanks,
liubo

> I spent a morning trying to trigger this with drop_caches and couldn't make
> it happen, even with schedule_timeout()s inserted and other tricks.  But I
> was able to get corruptions if I manually invalidated pages in the critical
> section.
>
> I'm working on a patch, and I'll check and see if any of the other recent
> fixes Dave integrated may have a less exotic explanation.
>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html