On Thu, Oct 10, 2024 at 9:13 AM Javier Gonzalez <javier.gonz@xxxxxxxxxxx> wrote: > > On 10.10.2024 08:40, Hans Holmberg wrote: > >On Wed, Oct 9, 2024 at 4:36 PM Javier Gonzalez <javier.gonz@xxxxxxxxxxx> wrote: > >> > >> > >> > >> > -----Original Message----- > >> > From: Hans Holmberg <hans@xxxxxxxxxxxxx> > >> > Sent: Tuesday, October 8, 2024 12:07 PM > >> > To: Javier Gonzalez <javier.gonz@xxxxxxxxxxx> > >> > Cc: Christoph Hellwig <hch@xxxxxx>; Jens Axboe <axboe@xxxxxxxxx>; Martin K. > >> > Petersen <martin.petersen@xxxxxxxxxx>; Keith Busch <kbusch@xxxxxxxxxx>; > >> > Kanchan Joshi <joshi.k@xxxxxxxxxxx>; hare@xxxxxxx; sagi@xxxxxxxxxxx; > >> > brauner@xxxxxxxxxx; viro@xxxxxxxxxxxxxxxxxx; jack@xxxxxxx; jaegeuk@xxxxxxxxxx; > >> > bcrl@xxxxxxxxx; dhowells@xxxxxxxxxx; bvanassche@xxxxxxx; > >> > asml.silence@xxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx; linux- > >> > fsdevel@xxxxxxxxxxxxxxx; io-uring@xxxxxxxxxxxxxxx; linux-block@xxxxxxxxxxxxxxx; > >> > linux-aio@xxxxxxxxx; gost.dev@xxxxxxxxxxx; vishak.g@xxxxxxxxxxx > >> > Subject: Re: [PATCH v7 0/3] FDP and per-io hints > >> > > >> > On Mon, Oct 7, 2024 at 12:10 PM Javier González <javier.gonz@xxxxxxxxxxx> > >> > wrote: > >> > > > >> > > On 04.10.2024 14:30, Christoph Hellwig wrote: > >> > > >On Fri, Oct 04, 2024 at 08:52:33AM +0200, Javier González wrote: > >> > > >> So, considerign that file system _are_ able to use temperature hints and > >> > > >> actually make them work, why don't we support FDP the same way we are > >> > > >> supporting zones so that people can use it in production? > >> > > > > >> > > >Because apparently no one has tried it. It should be possible in theory, > >> > > >but for example unless you have power of 2 reclaim unit size size it > >> > > >won't work very well with XFS where the AGs/RTGs must be power of two > >> > > >aligned in the LBA space, except by overprovisioning the LBA space vs > >> > > >the capacity actually used. > >> > > > >> > > This is good. I think we should have at least a FS POC with data > >> > > placement support to be able to drive conclusions on how the interface > >> > > and requirements should be. Until we have that, we can support the > >> > > use-cases that we know customers are asking for, i.e., block-level hints > >> > > through the existing temperature API. > >> > > > >> > > > > >> > > >> I agree that down the road, an interface that allows hints (many more > >> > > >> than 5!) is needed. And in my opinion, this interface should not have > >> > > >> semintics attached to it, just a hint ID, #hints, and enough space to > >> > > >> put 100s of them to support storage node deployments. But this needs to > >> > > >> come from the users of the hints / zones / streams / etc, not from > >> > > >> us vendors. We do not have neither details on how they deploy these > >> > > >> features at scale, nor the workloads to validate the results. Anything > >> > > >> else will probably just continue polluting the storage stack with more > >> > > >> interfaces that are not used and add to the problem of data placement > >> > > >> fragmentation. > >> > > > > >> > > >Please always mentioned what layer you are talking about. At the syscall > >> > > >level the temperatur hints are doing quite ok. A full stream separation > >> > > >would obviously be a lot better, as would be communicating the zone / > >> > > >reclaim unit / etc size. > >> > > > >> > > I mean at the syscall level. But as mentioned above, we need to be very > >> > > sure that we have a clear use-case for that. If we continue seeing hints > >> > > being use in NVMe or other protocols, and the number increase > >> > > significantly, we can deal with it later on. > >> > > > >> > > > > >> > > >As an interface to a driver that doesn't natively speak temperature > >> > > >hint on the other hand it doesn't work at all. > >> > > > > >> > > >> The issue is that the first series of this patch, which is as simple as > >> > > >> it gets, hit the list in May. Since then we are down paths that lead > >> > > >> nowhere. So the line between real technical feedback that leads to > >> > > >> a feature being merged, and technical misleading to make people be a > >> > > >> busy bee becomes very thin. In the whole data placement effort, we have > >> > > >> been down this path many times, unfortunately... > >> > > > > >> > > >Well, the previous round was the first one actually trying to address the > >> > > >fundamental issue after 4 month. And then after a first round of feedback > >> > > >it gets shutdown somehow out of nowhere. As a maintainer and review that > >> > > >is the kinda of contributors I have a hard time taking serious. > >> > > > >> > > I am not sure I understand what you mean in the last sentece, so I will > >> > > not respond filling blanks with a bad interpretation. > >> > > > >> > > In summary, what we are asking for is to take the patches that cover the > >> > > current use-case, and work together on what might be needed for better > >> > > FS support. For this, it seems you and Hans have a good idea of what you > >> > > want to have based on XFS. We can help review or do part of the work, > >> > > but trying to guess our way will only delay existing customers using > >> > > existing HW. > >> > > >> > After reading the whole thread, I end up wondering why we need to rush the > >> > support for a single use case through instead of putting the architecture > >> > in place for properly supporting this new type of hardware from the start > >> > throughout the stack. > >> > >> This is not a rush. We have been supporting this use case through passthru for > >> over 1/2 year with code already upstream in Cachelib. This is mature enough as > >> to move into the block layer, which is what the end user wants to do either way. > >> > >> This is though a very good point. This is why we upstreamed passthru at the > >> time; so people can experiment, validate, and upstream only when there is a > >> clear path. > >> > >> > > >> > Even for user space consumers of raw block devices, is the last version > >> > of the patch set good enough? > >> > > >> > * It severely cripples the data separation capabilities as only a handful of > >> > data placement buckets are supported > >> > >> I could understand from your presentation at LPC, and late looking at the code that > >> is available that you have been successful at getting good results with the existing > >> interface in XFS. The mapping form the temperature semantics to zones (no semantics) > >> is the exact same as we are doing with FDP. Not having to change neither in-kernel nor user-space > >> structures is great. > > > >No, we don't map data directly to zones using lifetime hints. In fact, > >lifetime hints contribute only a > >relatively small part (~10% extra write amp reduction, see the > >rocksdb benchmark results). > >Segregating data by file is the most important part of the data > >placement heuristic, at least > >for this type of workload. > > Is this because RocksDB already does seggregation per file itself? Are > you doing something specific on XFS or using your knoledge on RocksDB to > map files with an "unwritten" protocol in the midde? Data placement by-file is based on that the lifetime of a file's data blocks are strongly correlated. When a file is deleted, all its blocks will be reclaimable at that point. This requires knowledge about the data placement buckets and works really well without any hints provided. The life-time hint heuristic I added on top is based on rocksdb statistics, but designed to be generic enough to work for a wider range of workloads (still need to validate this though - more work to be done). > > In this context, we have collected data both using FDP natively in > RocksDB and using the temperatures. Both look very good, because both > are initiated by RocksDB, and the FS just passes the hints directly > to the driver. > > I ask this to understand if this is the FS responsibility or the > application's one. Our work points more to letting applications use the > hints (as the use-cases are power users, like RocksDB). I agree with you > that a FS could potentially make an improvement for legacy applications > - we have not focused much on these though, so I trust you insights on > it. The big problem as I see it is that if applications are going to work well together on the same media we need a common placement implementation somewhere, and it seems pretty natural to make it part of filesystems to me. > > >> > >> > > >> > * It just won't work if there is more than one user application per storage > >> > device as different applications data streams will be mixed at the nvme > >> > driver level.. > >> > >> For now this use-case is not clear. Folks working on it are using passthru. When we > >> have a more clear understanding of what is needed, we might need changes in the kernel. > >> > >> If you see a need for this on the work that you are doing, by all means, please send patches. > >> As I said at LPC, we can work together on this. > >> > >> > > >> > While Christoph has already outlined what would be desirable from a > >> > file system point of view, I don't have the answer to what would be the overall > >> > best design for FDP. I would like to say that it looks to me like we need to > >> > consider more than more than the early adoption use cases and make sure we > >> > make the most of the hardware capabilities via logical abstractions that > >> > would be compatible with a wider range of storage devices. > >> > > >> > Figuring the right way forward is tricky, but why not just let it take the time > >> > that is needed to sort this out while early users explore how to use FDP > >> > drives and share the results? > >> > >> I agree that we might need a new interface to support more hints, beyond the temperatures. > >> Or maybe not. We would not know until someone comes with a use case. We have made the > >> mistake in the past of treating internal research as upstreamable work. I know can see that > >> this simply complicates the in-kernel and user-space APIs. > >> > >> The existing API is usable and requires no changes. There is hardware. There are customers. > >> There are applications with upstream support which have been tested with passthru (the > >> early results you mention). And the wiring to NVMe is _very_ simple. There is no reason > >> not to take this in, and then we will see what new interfaces we might need in the future. > >> > >> I would much rather spend time in discussing ideas with you and others on a potential > >> future API than arguing about the validity of an _existing_ one. > >> > > > >Yes, but while FDP support could be improved later on(happy to help if > >that'll be the case), > >I'm just afraid that less work now defining the way data placement is > >exposed is going to > >result in a bigger mess later when more use cases will be considered. > > Please, see the message I responded on the other thread. I hope it is a > way to move forward and actually work together on this.