On Wed, Nov 13, 2024 at 05:47:36AM +0100, Christoph Hellwig wrote: > On Tue, Nov 12, 2024 at 06:18:21PM +0000, Pierre Labat wrote: > > About 2) > > Provide a simple way to the user to decide which layer generate write hints. > > As an example, as some of you pointed out, what if the filesystem wants to generate write hints to optimize its [own] data handling by the storage, and at the same time the application using the FS understand the storage and also wants to optimize using write hints. > > Both use cases are legit, I think. > > To handle that in a simple way, why not have a filesystem mount parameter enabling/disabling the use of write hints by the FS? > > The file system is, and always has been, the entity in charge of > resource allocation of the underlying device. Bypassing it will get > you in trouble, and a simple mount option isn't really changing that > (it's also not exactly a scalable interface). > > If an application wants to micro-manage placement decisions it should not > use a file system, or at least not a normal one with Posix semantics. > That being said we'd demonstrated that applications using proper grouping > of data by file and the simple temperature hints can get very good result > from file systems that can interpret them, without a lot of work in the > file system. I suspect for most applications that actually want files > that is actually going to give better results than trying to do the > micro-management that tries to bypass the file system. This. The most important thing that filesystems do behind the scenes is manage -data locality-. XFS has thousands of lines of code to manage and control data locality - the allocation policy API itself has a *dozens* control parameters. We have 2 separate allocation architectures (one btree based, one bitmap based) and multiple locality policy algorithms. These juggled physical alignment, size granularity, size limits, data type being allocated for, desired locality targets, different search algorithms (e.g. first fit, best fit, exact fit by size or location, etc), multiple fallback strategies when the initial target cannot be met, etc. Allocation policy management is the core of every block based filesystem that has ever been written. Specifically to this "stream hint" discussion: go look at the XFS filestreams allocator. SGI wrote an entirely new allocator for XFS whose only purpose in life is to automatically separate individual streams of user data into physically separate regions of LBA space. This was written to optimise realtime ingest and playback of multiple uncompressed 4k and 8k video data streams from big isochronous SAN storage arrays back in ~2005. Each stream could be up to 1.2GB/s of data. If the data for each IO was not exactly placed in alignment with the storage array stripe cache granularity (2MB, IIRC), then a cache miss would occur and the IO latency would be too high and frames of data would be missed/dropped. IOWs, we have an allocator in XFS that specifically designed to separate indepedent streams of data to independent regions of the filesystem LBA space to effcient support data IO rates in the order of tens of GB/s. What are we talking about now? Storage hardware that might be able to do 10-15GB/s of IO that needs stream separation for efficient management of the internal storage resources. The fact we have previously solved this class of stream separation problem at the filesystem level *without needing a user-controlled API at all* is probably the most relevant fact missing from this discussion. As to the concern about stream/temp/hint translation consistency across different hardware: the filesystem is the perfect place to provide this abstraction to users. The block device can expose what it supports, the user API can be fixed, and the filesystem can provide the mapping between the two that won't change for the life of the filesystem... Long story short: Christoph is right. The OS hints/streams API needs to be aligned to the capabilities that filesystems already provide *as a primary design goal*. What the new hardware might support is a secondary concern. i.e. hardware driven software design is almost always a mistake: define the user API and abstractions first, then the OS can reduce it sanely down to what the specific hardware present is capable of supporting. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx