On Fri, Nov 1, 2024 at 3:49 PM Keith Busch <kbusch@xxxxxxxxxx> wrote: > > On Fri, Nov 01, 2024 at 08:16:30AM +0100, Hans Holmberg wrote: > > On Thu, Oct 31, 2024 at 3:06 PM Keith Busch <kbusch@xxxxxxxxxx> wrote: > > > On Thu, Oct 31, 2024 at 09:19:51AM +0100, Hans Holmberg wrote: > > > > No. The meta data IO is just 0.1% of all writes, so that we use a > > > > separate device for that in the benchmark really does not matter. > > > > > > It's very little spatially, but they overwrite differently than other > > > data, creating many small holes in large erase blocks. > > > > I don't really get how this could influence anything significantly.(If at all). > > Fill your filesystem to near capacity, then continue using it for a few > months. While the filesystem will report some available space, there > may not be many good blocks available to erase. Maybe. For *this* benchmark workload, the metadata io is such a tiny fraction so I doubt the impact on wa could be measured. I completely agree it's a good idea to separate metadata from data blocks in general. It is actually a good reason for letting the file system control write stream allocation for all blocks :) > > I believe it would be worthwhile to prototype active fdp data > > placement in xfs and evaluate it. Happy to help out with that. > > When are we allowed to conclude evaluation? We have benefits my > customers want on well tested kernels, and wish to proceed now. Christoph has now wired up prototype support for FDP on top of the xfs-rt-zoned work + this patch set, and I have had time to look over it and started doing some testing on HW. In addition to the FDP support, metadata can also be stored on the same block device as the data. Now that all placement handles are available, we can use the full data separation capabilities of the underlying storage, so that's good. We can map out the placement handles to different write streams much like we assign open zones for zoned storage and this opens up for supporting data placement heuristics for a wider range use cases (not just the RocksDB use case discussed here). The big pieces that are missing from the FDP plumbing as I see it is the ability to read reclaim unit size and syncing up the remaining capacity of the placement units with the file system allocation groups, but I guess that can be added later. I've started benchmarking on the hardware I have at hand, iterating on a good workload configuration. It will take some time to get to some robust write amp measurements since the drives are very big and require a painfully long warmup time.