Re: [LSF/MM/BPF TOPIC] Large block for I/O

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 26 Feb 2024 10:09:08 +1100

On Thu, Feb 22, 2024 at 10:45:25AM -0800, Luis Chamberlain wrote:
> On Mon, Jan 08, 2024 at 07:35:17PM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 08, 2024 at 11:30:10AM -0800, Bart Van Assche wrote:
> > > On 12/21/23 21:37, Christoph Hellwig wrote:
> > > > On Fri, Dec 22, 2023 at 05:13:43AM +0000, Matthew Wilcox wrote:
> > > > > It clearly solves a problem (and the one I think it's solving is the
> > > > > size of the FTL map).  But I can't see why we should stop working on it,
> > > > > just because not all drive manufacturers want to support it.
> > > > 
> > > > I don't think it is drive vendors.  It is is the SSD divisions which
> > > > all pretty much love it (for certain use cases) vs the UFS/eMMC
> > > > divisions which tends to often be fearful and less knowledgeable (to
> > > > say it nicely) no matter what vendor you're talking to.
> > > 
> > > Hi Christoph,
> > > 
> > > If there is a significant number of 4 KiB writes in a workload (e.g.
> > > filesystem metadata writes), and the logical block size is increased from
> > > 4 KiB to 16 KiB, this will increase write amplification no matter how the
> > > SSD storage controller has been designed, isn't it? Is there perhaps
> > > something that I'm misunderstanding?
> > 
> > You're misunderstanding that it's the _drive_ which gets to decide the
> > logical block size. Filesystems literally can't do 4kB writes to these
> > drives; you can't do a write smaller than a block.  If your clients
> > don't think it's a good tradeoff for them, they won't tell Linux that
> > the minimum IO size is 16kB.
> 
> Yes, but its perhaps good to review how flexible this might be or not.
> I can at least mention what I know of for NVMe. Getting a lay of the
> land of this for other storage media would be good.
> 
> Some of the large capacity NVMe drives have NPWG as 16k, that just means
> the Indirection Unit is 16k, the mapping table, so the drive is hinting
> *we prefer 16k* but you can still do 4k writes, it just means for all
> these drives that a 4k write will be a RMW.

That's just a 4kb logical sector, 16kB physical sector block device,
yes?

Maybe I'm missing something, but we already handle cases like that
just fine thanks to all the work that went into supporting 512e
devices...

> Users who *want* to help avoid RMWs on these drives and want to increase the
> writes to be at least 16k can enable a 16k or larger block size so to
> align the writes. The experimentation we have done using Daniel Gomez's
> eBPF blkalgn tool [0] reveal (as discussed at last year's Plumbers) that
> there were still some 4k writes, this was in turn determined to be due
> to XFS's buffer cache usage for metadata.

As I've explained several times, XFS AG headers are sector sized
metadata. If you are exposing a 4kB logical sector size on a 16kB
physical sector device, this is what you'll get. It's been that way
with 512e devices for a long time, yes?

Also, direct IO will allow sector sized user data IOs, too, so it's
not just XFS metadata that will be issuing 4kB IO in this case...

> Dave recently posted patches to allow
> to use large folios on the xfs buffer cache [1],

This has nothing to do with supporting large block sizes - it's
purely an internal optimisation to reduce the amount of vmap
(vmalloc) work we have to do for buffers that are larger than
PAGE_SIZE on 4kB block size filesystems.

> For large capacity NVMe drives with large atomics (NAUWPF), the
> nvme block driver will allow for the physical block size to be 16k too,
> thus allowing the sector size to be set to 16k when creating the
> filesystem, that would *optionally* allow for users to force the
> filesystem to not allow *any* writes to the device to be 4k.

Just present it as a 16kB logical/physical sector block device. Then
userspace and the filesystem will magically just do the right thing.

We've already solved these problems, yes?

> Note
> then that there are two ways to be able to use a sector size of 16k
> for NVMe today then, one is if your drive supported 16 LBA format and
> another is with these two parameters set to 16k. The later allows you
> to stick with 512 byte or 4k LBA format and still use a 16k sector size.
> That allows you to remain backward compatible.

Yes, that's an emulated small logical sector size block device.
We've been supporting this for years - how are these NVMe drives in
any way different? Configure the drive this way, it presents as a
512e or 4096e device, not a 16kB sector size device, yes?

> Jan Kara's patches "block: Add config option to not allow writing to
> mounted devices" [2] should allow us to remove the set_blocksize() call
> in xfs_setsize_buftarg() since XFS does not use the block device cache
> at all, and his pathces ensure once a filesystem is mounted userspace
> won't muck with the block device directly.

That patch is completely irrelevant to how the block device presents
sector sizes to userspace and the filesystem. It's also completely
irrelevant to large block size support in filesystems. Why do you
think it is relevant at all?

<snip irrelevant WAF stuff that we already know all about>

I'm not sure exactly what is being argued about here, but if the
large sector size support requires filesystem utilities to treat
4096e NVMe devices differently to existing 512e devices then the
large sector size support stuff has gone completely off the rails.

We already have all the mechanisms needed for optimising layouts for
large physical sector sizes w/ small emulated sector sizes and we
have widespread userspace support for that. If this new large block
device sector stuff doesn't work the same way, then you need to go
back to the drawing board and make it work transparently with all
the existing userspace infrastructure....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx