Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

Javier González <javier.gonz@xxxxxxxxxxx> · Fri, 10 Mar 2023 08:59:28 +0100

On 09.03.2023 08:11, James Bottomley wrote:
On Thu, 2023-03-09 at 09:04 +0100, Javier González wrote:
On 08.03.2023 13:13, James Bottomley wrote:
> On Wed, 2023-03-08 at 17:53 +0000, Matthew Wilcox wrote:
> > On Mon, Mar 06, 2023 at 11:12:14AM -0500, Theodore Ts'o wrote:
> > > What HDD vendors want is to be able to have 32k or even 64k
> > > *physical* sector sizes.  This allows for much more efficient
> > > erasure codes, so it will increase their byte capacity now that
> > > it's no longer easier to get capacity boosts by squeezing the
> > > tracks closer and closer, and their have been various
> > > engineering tradeoffs with SMR, HAMR, and MAMR.  HDD vendors
> > > have been asking for this at LSF/MM, and in othervenues for
> > > ***years***.
> >
> > I've been reminded by a friend who works on the drive side that a
> > motivation for the SSD vendors is (essentially) the size of
> > sector_t. Once the drive needs to support more than 2/4 billion
> > sectors, they need to move to a 64-bit sector size, so the amount
> > of memory consumed by the FTL doubles, the CPU data cache becomes
> > half as effective, etc. That significantly increases the BOM for
> > the drive, and so they have to charge more.  With a 512-byte LBA,
> > that's 2TB; with a 4096-byte LBA, it's at 16TB and with a 64k
> > LBA, they can keep using 32-bit LBA numbers all the way up to
> > 256TB.
>
> I thought the FTL operated on physical sectors and the logical to
> physical was done as a RMW through the FTL?  In which case sector_t
> shouldn't matter to the SSD vendors for FTL management because they
> can keep the logical sector size while increasing the physical one.
> Obviously if physical size goes above the FS block size, the drives
> will behave suboptimally with RMWs, which is why 4k physical is the
> max currently.
>

FTL designs are complex. We have ways to maintain sector sizes under
64 bits, but this is a common industry problem.

The media itself does not normally oeprate at 4K. Page siges can be
16K, 32K, etc.

Right, and we've always said if we knew what this size was we could
make better block write decisions.  However, today if you look what
most NVMe devices are reporting, it's a bit sub-optimal:

jejb@lingrow:/sys/block/nvme1n1/queue> cat logical_block_size
512
jejb@lingrow:/sys/block/nvme1n1/queue> cat physical_block_size
512
jejb@lingrow:/sys/block/nvme1n1/queue> cat optimal_io_size
0

If we do get Linux to support large block sizes, are we actually going
to get better information out of the devices?

We already have this through the NVMe Optimal Performance parameters
(see Dan's response for this). Note that these values are already
implemented in the kernel. If I recall properly, Bart was the one doing
this work.

More over, from the vendor side, it is a challenge to expose larger LBAs
without wide support in OSs. I am confident that if we are pushing for
this work and we see it fits existing FSs, we will see vendors exposing
new LBA formats in the beginning (same as we have 512b and 4K in the
same drive), and eventually focusing only on larger LBA sizes.

 Increasing the block size would allow for better host/device
cooperation. As Ted mentions, this has been a requirement for HDD and
SSD vendor for years. It seems to us that the time is right now and
that we have mechanisms in Linux to do the plumbing. Folios is
ovbiously a big part of this.

Well a decade ago we did a lot of work to support 4k sector devices.
Ultimately the industry went with 512 logical/4k physical devices
because of problems with non-Linux proprietary OSs but you could still
use 4k today if you wanted (I've actually still got a working 4k SCSI
drive), so why is no NVMe device doing that?

Most NVMe devices report 4K today. Actually 512b is mostly an
optimization targeted at read-heavy workloads.

This is not to say I think larger block sizes is in any way a bad idea
... I just think that given the history, it will be driven by
application needs rather than what the manufacturers tell us.

I see more and more that this deserves a session at LSF/MM