On Tue 16-02-21 23:05:57, Damien Le Moal wrote: > On 2021/02/17 2:51, Keith Busch wrote: > > On Tue, Feb 16, 2021 at 04:36:06PM +0000, Christoph Hellwig wrote: > >> On Tue, Feb 16, 2021 at 02:38:49PM +0100, Jan Kara wrote: > >>> Apparently there are several userspace programs that depend on being > >>> able to call BLKDISCARD ioctl without the ability to grab bdev > >>> exclusively - namely FUSE filesystems have the device open without > >>> O_EXCL (the kernel has the bdev open with O_EXCL) so the commit breaks > >>> fstrim(8) for such filesystems. Also LVM when shrinking LV opens PV and > >>> discards ranges released from LV but that PV may be already open > >>> exclusively by someone else (see bugzilla link below for more details). > >>> > >>> This reverts commit 384d87ef2c954fc58e6c5fd8253e4a1984f5fe02. > >> > >> I think that is a bad idea. We fixed the problem for a reason. > >> I think the right fix is to just do nothing if the device hasn't been > >> opened with O_EXCL and can't be reopened with it, just don't do anything > >> but also don't return an error. After all discard and thus > >> BLKDISCARD is purely advisory. > > > > A discard is advisory, but BLKZEROOUT is not, so something different > > should happen there. We were also planning to send a patch using this > > same pattern for Zone Reset to fix stale page cache issues after the > > reset, but we'll wait to see how this settles before sending that. > > There is also another problem: the truncate_bdev & operation following it > (discard, zeroout or zone reset) are not atomic vs read/write operations to the > bdev. Without mutual exclusion, that page invalidation is best effort only since > reads can snick in between the truncate and discard (or zeroout or zone reset). > With our zone reset stale page problem case, it is reads from udevd that we see > snicking in between the truncate bdev and zone reset and so we still get stale > pages after the zone reset is finished. No solution to propose for solving that, > yet... Well, at least blkdev_fallocate() does: truncate_bdev_range(); blkdev_issue_zeroout(); invalidate_inode_pages2_range(); so racing reads should not result in stale page cache contents AFAICT. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR