Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 18 Sep 2023 12:49:36 +1000

On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote:
> On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > > device then to take advantage of the recenlty posted work today to enable
> > > > > LBS support for filesystems [0].
> > > > 
> > > > Why do we need LBS devices to support bs > ps in XFS?
> > > 
> > > It's the other way round -- we need the support in the page cache to
> > > reject sub-block-size folios (which is in the other patches) before we
> > > can sensibly talk about enabling any filesystems on top of LBS devices.
> > >
> > > Even XFS, or for that matter ext2 which support 16k block sizes on
> > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> > 
> > Well, yes, I know that. But the statement above implies that we
> > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> > systems. If it's meant to mean the exact opposite, then it is
> > extremely poorly worded....
> 
> Let's ignore the above statement of a second just to clarify the goal here.
> The patches posted by Pankaj enable bs > ps even on 4k sector drives.

Yes. They also enable XFS to support bs > ps on devices up to 32kB
sector sizes, too. All the sector size does is define the minimum
filesystem block size that can be supported by the filesystem on
that device and apart from that we just don't care what the sector
size on the underlying device is.

> This patch series by definition is suggesting that an LBS device is one
> where the minimum sector size is > ps, in practice I'm talking about
> devices where the logical block size is > 4k, or where the physical
> block size can be > 4k.

Which XFS with bs > ps just doesn't care about. As long as the
logical sector size is a power of 2 between 512 bytes and 32kB, it
will just work.

> There are two situations in the NVMe world where
> this can happen. One is where the LBA format used is > 4k, the other is
> where the npwg & awupf | nawupf is > 4k. The first makes the logical

FWIW, I have no idea what these acronyms actually mean....

> block size > 4k, the later allows the physical block size to be > 4k.
> The first *needs* an aops which groks > ps on the block device cache.
> The second let's you remain backward compatible with 4k sector size, but
> if you want to use a larger sector size you can too, but that also
> requires a block device cache which groks > ps. When using > ps for
> logical block size of physical block size is what I am suggesting we
> should call LBS devices.

Sure. LBS means sector size > page size for the block device. That
much has always been clear.

But telling me that again doesn't explain what LBS support has to do
with the filesystem implementation. mkfs.xfs doesn't require LBS
support to make a new XFS filesystem with a 32kB sector size and
64kB filessytem block size.  For the mounted filesystem that
supports bs > ps, it also doesn't care about the device sector size
is smaller than what mkfs told it to us. e.g. this is how run 4kB
sector size filesystem testing on 512 byte sector devices today....

What I'm asking is why LBS support even mentions filesystems? It's
not necessary for filesystems to support bs > ps for LBS to be
implemented, and it's not necessary for LBS to be supported for
filesytsems to implement bs > ps. Conflating them seems a recipe
for confusion....

> > > > > There might be a better way to do this than do deal with the switching
> > > > > of the aops dynamically, ideas welcomed!
> > > > 
> > > > Is it even safe to switch aops dynamically? We know there are
> > > > inherent race conditions in doing this w.r.t. mmap and page faults,
> > > > as the write fault part of the processing is directly dependent
> > > > on the page being correctly initialised during the initial
> > > > population of the page data (the "read fault" side of the write
> > > > fault).
> > > > 
> > > > Hence it's not generally considered safe to change aops from one
> > > > mechanism to another dynamically. Block devices can be mmap()d, but
> > > > I don't see anything in this patch set that ensures there are no
> > > > other users of the block device when the swaps are done. What am I
> > > > missing?
> > > 
> > > We need to evict all pages from the page cache before switching aops to
> > > prevent misinterpretation of folio->private. 
> > 
> > Yes, but if the device is mapped, even after an invalidation, we can
> > still race with a new fault instantiating a page whilst the aops are
> > being swapped, right? That was the problem that sunk dynamic
> > swapping of the aops when turning DAX on and off on an inode, right?
> 
> I was not aware of that, thanks!
> 
> > > If switching aops is even
> > > the right thing to do.  I don't see the problem with allowing buffer heads
> > > on block devices, but I haven't been involved with the discussion here.
> > 
> > iomap supports bufferheads as a transitional thing (e.g. for gfs2).
> 
> But not for the block device cache.

That's why I'm suggesting that you implement support for bufferheads
through the existing iomap infrastructure instead of trying to
dynamically switch aops structures....

> > Hence I suspect that a better solution is to always use iomap and
> > the same aops, but just switch from iomap page state to buffer heads
> > in the bdev mapping 
> 
> Not sure this means, any chance I can trouble you to clarify a bit more?

bdev_use_buggerheads()
{
	/*
	 * set the bufferhead bit atomically with invalidation emptying the
	 * page cache to prevent repopulation races. 
	 */
	filemap_invalidate_lock()
	invalidate_bdev()
	if (turn_on_bufferheads)
		set_bit(bdev->state, BDEV_USE_BUFFERHEADS);
	else
		clear_bit(bdev->state, BDEV_USE_BUFFERHEADS);
	filemap_invalidate_unlock()
}

bdev_iomap_begin()
{
	.....
	if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS))
		iomap->flags |= IOMAP_F_BUFFER_HEAD;
}

/*
 * If an indexing switch happened while processing the iomap, make
 * sure to get the iomap marked stale to force a new mapping to be
 * looked up.
 */
bdev_iomap_valid()
{
	bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD;
	bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS);

	return need_bhs == use_bhs;
}

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx