Re: [PATCH] Revert "block: Do not discard buffers under a mounted filesystem"

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Thu, 18 Feb 2021 22:35:41 +0000

On 2021/02/18 23:07, Jan Kara wrote:
> On Tue 16-02-21 23:05:57, Damien Le Moal wrote:
>> On 2021/02/17 2:51, Keith Busch wrote:
>>> On Tue, Feb 16, 2021 at 04:36:06PM +0000, Christoph Hellwig wrote:
>>>> On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote:
>>>>> Apparently there are several userspace programs that depend on being
>>>>> able to call BLKDISCARD ioctl without the ability to grab bdev
>>>>> exclusively - namely FUSE filesystems have the device open without
>>>>> O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks
>>>>> fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and
>>>>> discards ranges released from LV but that PV may be already open
>>>>> exclusively by someone else (see bugzilla link below for more details).
>>>>>
>>>>> This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02.
>>>>
>>>> I think that is a bad idea. We fixed the problem for a reason.
>>>> I think the right fix is to just do nothing if the device hasn't been
>>>> opened with O_EXCL and can't be reopened with it, just don't do anything
>>>> but also don't return an error.  After all discard and thus
>>>> BLKDISCARD is purely advisory.
>>>
>>> A discard is advisory, but BLKZEROOUT is not, so something different
>>> should happen there. We were also planning to send a patch using this
>>> same pattern for Zone Reset to fix stale page cache issues after the
>>> reset, but we'll wait to see how this settles before sending that.
>>
>> There is also another problem: the truncate_bdev & operation following it
>> (discard, zeroout or zone reset) are not atomic vs read/write operations to the
>> bdev. Without mutual exclusion, that page invalidation is best effort only since
>> reads can snick in between the truncate and discard (or zeroout or zone reset).
>> With our zone reset stale page problem case, it is reads from udevd that we see
>> snicking in between the truncate bdev and zone reset and so we still get stale
>> pages after the zone reset is finished. No solution to propose for solving that,
>> yet...
> 
> Well, at least blkdev_fallocate() does:
> 
> 	truncate_bdev_range();
> 	blkdev_issue_zeroout();
> 	invalidate_inode_pages2_range();
> 
> so racing reads should not result in stale page cache contents AFAICT.

Yes, but concurrent writes can then get in between the blkdev_issue_zeroout()
and invalidate_inode_pages2_range() and data discarded before hitting the
drive... Not very nice either. Granted, that would mean that userland has 2
concurrent writers that are not synchronized. So weird results are to be
expected. I guess it is probably safe to ignore that case ?

I guess the same pattern as above for zeroout would work for reset zone too.
Will try to see if that solves our test problem.

-- 
Damien Le Moal
Western Digital Research