Re: [PATCH v7 0/3] FDP and per-io hints

Javier Gonzalez <javier.gonz@xxxxxxxxxxx> · Thu, 10 Oct 2024 14:27:33 +0200

On 10.10.2024 12:46, Hans Holmberg wrote:
On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <javier.gonz@xxxxxxxxxxx> wrote:

On 10.10.2024 08:40, Hans Holmberg wrote:
>On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@xxxxxxxxxxx> wrote:
>>
>>
>>
>> > -----Original Message-----
>> > From: Hans Holmberg <hans@xxxxxxxxxxxxx>
>> > Sent: Tuesday, October 8, 2024 12:07 PM
>> > To: Javier Gonzalez <javier.gonz@xxxxxxxxxxx>
>> > Cc: Christoph Hellwig <hch@xxxxxx>; Jens Axboe <axboe@xxxxxxxxx>; Martin K.
>> > Petersen <martin.petersen@xxxxxxxxxx>; Keith Busch <kbusch@xxxxxxxxxx>;
>> > Kanchan Joshi <joshi.k@xxxxxxxxxxx>; hare@xxxxxxx; sagi@xxxxxxxxxxx;
>> > brauner@xxxxxxxxxx; viro@xxxxxxxxxxxxxxxxxx; jack@xxxxxxx; jaegeuk@xxxxxxxxxx;
>> > bcrl@xxxxxxxxx; dhowells@xxxxxxxxxx; bvanassche@xxxxxxx;
>> > asml.silence@xxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx; linux-
>> > fsdevel@xxxxxxxxxxxxxxx; io-uring@xxxxxxxxxxxxxxx; linux-block@xxxxxxxxxxxxxxx;
>> > linux-aio@xxxxxxxxx; gost.dev@xxxxxxxxxxx; vishak.g@xxxxxxxxxxx
>> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints
>> >
>> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@xxxxxxxxxxx>
>> > wrote:
>> > >
>> > > On 04.10.2024 14:30, Christoph Hellwig wrote:
>> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote:
>> > > >> So, considerign that file system _are_ able to use temperature hints and
>> > > >> actually make them work, why don't we support FDP the same way we are
>> > > >> supporting zones so that people can use it in production?
>> > > >
>> > > >Because apparently no one has tried it.  It should be possible in theory,
>> > > >but for example unless you have power of 2 reclaim unit size size it
>> > > >won't work very well with XFS where the AGs/RTGs must be power of two
>> > > >aligned in the LBA space, except by overprovisioning the LBA space vs
>> > > >the capacity actually used.
>> > >
>> > > This is good. I think we should have at least a FS POC with data
>> > > placement support to be able to drive conclusions on how the interface
>> > > and requirements should be. Until we have that, we can support the
>> > > use-cases that we know customers are asking for, i.e., block-level hints
>> > > through the existing temperature API.
>> > >
>> > > >
>> > > >> I agree that down the road, an interface that allows hints (many more
>> > > >> than 5!) is needed. And in my opinion, this interface should not have
>> > > >> semintics attached to it, just a hint ID, #hints, and enough space to
>> > > >> put 100s of them to support storage node deployments. But this needs to
>> > > >> come from the users of the hints / zones / streams / etc,  not from
>> > > >> us vendors. We do not have neither details on how they deploy these
>> > > >> features at scale, nor the workloads to validate the results. Anything
>> > > >> else will probably just continue polluting the storage stack with more
>> > > >> interfaces that are not used and add to the problem of data placement
>> > > >> fragmentation.
>> > > >
>> > > >Please always mentioned what layer you are talking about.  At the syscall
>> > > >level the temperatur hints are doing quite ok.  A full stream separation
>> > > >would obviously be a lot better, as would be communicating the zone /
>> > > >reclaim unit / etc size.
>> > >
>> > > I mean at the syscall level. But as mentioned above, we need to be very
>> > > sure that we have a clear use-case for that. If we continue seeing hints
>> > > being use in NVMe or other protocols, and the number increase
>> > > significantly, we can deal with it later on.
>> > >
>> > > >
>> > > >As an interface to a driver that doesn't natively speak temperature
>> > > >hint on the other hand it doesn't work at all.
>> > > >
>> > > >> The issue is that the first series of this patch, which is as simple as
>> > > >> it gets, hit the list in May. Since then we are down paths that lead
>> > > >> nowhere. So the line between real technical feedback that leads to
>> > > >> a feature being merged, and technical misleading to make people be a
>> > > >> busy bee becomes very thin. In the whole data placement effort, we have
>> > > >> been down this path many times, unfortunately...
>> > > >
>> > > >Well, the previous round was the first one actually trying to address the
>> > > >fundamental issue after 4 month.  And then after a first round of feedback
>> > > >it gets shutdown somehow out of nowhere.  As a maintainer and review that
>> > > >is the kinda of contributors I have a hard time taking serious.
>> > >
>> > > I am not sure I understand what you mean in the last sentece, so I will
>> > > not respond filling blanks with a bad interpretation.
>> > >
>> > > In summary, what we are asking for is to take the patches that cover the
>> > > current use-case, and work together on what might be needed for better
>> > > FS support. For this, it seems you and Hans have a good idea of what you
>> > > want to have based on XFS. We can help review or do part of the work,
>> > > but trying to guess our way will only delay existing customers using
>> > > existing HW.
>> >
>> > After reading the whole thread, I end up wondering why we need to rush the
>> > support for a single use case through instead of putting the architecture
>> > in place for properly supporting this new type of hardware from the start
>> > throughout the stack.
>>
>> This is not a rush. We have been supporting this use case through passthru for
>> over 1/2 year with code already upstream in Cachelib. This is mature enough as
>> to move into the block layer, which is what the end user wants to do either way.
>>
>> This is though a very good point. This is why we upstreamed passthru at the
>> time; so people can experiment, validate, and upstream only when there is a
>> clear path.
>>
>> >
>> > Even for user space consumers of raw block devices, is the last version
>> > of the patch set good enough?
>> >
>> > * It severely cripples the data separation capabilities as only a handful of
>> >   data placement buckets are supported
>>
>> I could understand from your presentation at LPC, and late looking at the code that
>> is available that you have been successful at getting good results with the existing
>> interface in XFS. The mapping form the temperature semantics to zones (no semantics)
>> is the exact same as we are doing with FDP. Not having to change neither in-kernel  nor user-space
>> structures is great.
>
>No, we don't map data directly to zones using lifetime hints. In fact,
>lifetime hints contribute only a
>relatively small part  (~10% extra write amp reduction, see the
>rocksdb benchmark results).
>Segregating data by file is the most important part of the data
>placement heuristic, at least
>for this type of workload.

Is this because RocksDB already does seggregation per file itself? Are
you doing something specific on XFS or using your knoledge on RocksDB to
map files with an "unwritten" protocol in the midde?

Data placement by-file is based on that the lifetime of a file's data
blocks are strongly correlated. When a file is deleted, all its blocks
will be reclaimable at that point. This requires knowledge about the
data placement buckets and works really well without any hints
provided.

But we need hints to put files together. I believe you do this already,
as no placement protocol gives you unlimited separation.

The life-time hint heuristic I added on top is based on rocksdb
statistics, but designed to be generic enough to work for a wider
range of workloads (still need to validate this though - more work to
be done).

Maybe you can post some patches on the parts dedicated to the VFS level
and user-space API (syscall or uring)?

Following on the comment to Christoph, it would be good to have
something tangible to work together on for the next stage of this
support.

    In this context, we have collected data both using FDP natively in
    RocksDB and using the temperatures. Both look very good, because both
    are initiated by RocksDB, and the FS just passes the hints directly
    to the driver.

I ask this to understand if this is the FS responsibility or the
application's one. Our work points more to letting applications use the
hints (as the use-cases are power users, like RocksDB). I agree with you
that a FS could potentially make an improvement for legacy applications
- we have not focused much on these though, so I trust you insights on
it.

The big problem as I see it is that if applications are going to work
well together on the same media we need a common placement
implementation somewhere, and it seems pretty natural to make it part
of filesystems to me.

For FS users, makes a lot of sense. But we still need to cover
applications using raw block.