Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Sun, 17 Sep 2023 18:13:40 -0700

On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > device then to take advantage of the recenlty posted work today to enable
> > > > LBS support for filesystems [0].
> > > 
> > > Why do we need LBS devices to support bs > ps in XFS?
> > 
> > It's the other way round -- we need the support in the page cache to
> > reject sub-block-size folios (which is in the other patches) before we
> > can sensibly talk about enabling any filesystems on top of LBS devices.
> >
> > Even XFS, or for that matter ext2 which support 16k block sizes on
> > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> 
> Well, yes, I know that. But the statement above implies that we
> can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> systems. If it's meant to mean the exact opposite, then it is
> extremely poorly worded....

Let's ignore the above statement of a second just to clarify the goal here.
The patches posted by Pankaj enable bs > ps even on 4k sector drives.

This patch series by definition is suggesting that an LBS device is one
where the minimum sector size is > ps, in practice I'm talking about
devices where the logical block size is > 4k, or where the physical
block size can be > 4k. There are two situations in the NVMe world where
this can happen. One is where the LBA format used is > 4k, the other is
where the npwg & awupf | nawupf is > 4k. The first makes the logical
block size > 4k, the later allows the physical block size to be > 4k.
The first *needs* an aops which groks > ps on the block device cache.
The second let's you remain backward compatible with 4k sector size, but
if you want to use a larger sector size you can too, but that also
requires a block device cache which groks > ps. When using > ps for
logical block size of physical block size is what I am suggesting we
should call LBS devices.

I'm happy for us to use other names but it just seemed like a natural
preogression to me.

> > > > There might be a better way to do this than do deal with the switching
> > > > of the aops dynamically, ideas welcomed!
> > > 
> > > Is it even safe to switch aops dynamically? We know there are
> > > inherent race conditions in doing this w.r.t. mmap and page faults,
> > > as the write fault part of the processing is directly dependent
> > > on the page being correctly initialised during the initial
> > > population of the page data (the "read fault" side of the write
> > > fault).
> > > 
> > > Hence it's not generally considered safe to change aops from one
> > > mechanism to another dynamically. Block devices can be mmap()d, but
> > > I don't see anything in this patch set that ensures there are no
> > > other users of the block device when the swaps are done. What am I
> > > missing?
> > 
> > We need to evict all pages from the page cache before switching aops to
> > prevent misinterpretation of folio->private. 
> 
> Yes, but if the device is mapped, even after an invalidation, we can
> still race with a new fault instantiating a page whilst the aops are
> being swapped, right? That was the problem that sunk dynamic
> swapping of the aops when turning DAX on and off on an inode, right?

I was not aware of that, thanks!

> > If switching aops is even
> > the right thing to do.  I don't see the problem with allowing buffer heads
> > on block devices, but I haven't been involved with the discussion here.
> 
> iomap supports bufferheads as a transitional thing (e.g. for gfs2).

But not for the block device cache.

> Hence I suspect that a better solution is to always use iomap and
> the same aops, but just switch from iomap page state to buffer heads
> in the bdev mapping 

Not sure this means, any chance I can trouble you to clarify a bit more?

> interface via a synchronised invalidation +
> setting/clearing IOMAP_F_BUFFER_HEAD in all new mapping requests...

I'm happy we use whatever makes more sense and is safer, just keep in
mind that we can't default to the iomap aops otherwise buffer-head
based filesystems won't work. I tried that. The first aops which the
block device cache needs if we want co-existence is buffer-heads.

If buffer-heads get large folio support then this is a non-issue as
far as I can tell but only because XFS does not use buffer-heads for
meta data. Once and when we do have a filesystem which does use
buffer-heads for metadata to support LBS it could be a problem I think.

  Luis