On Fri, Jan 31, 2025 at 03:02:09PM +0530, Kundan Kumar wrote: > > IOWs, having too much parallelism in writeback for the underlying > > storage and/or filesystem can be far more harmful to system > > performance under load than having too little parallelism to drive > > the filesystem/hardware to it's maximum performance. > > With increasing speed of devices we would like to improve the performance of > buffered IO as well. This will help the applications(DB, AI/ML) using buffered > I/O. If more parallelism is causing side effect, we can reduce it using some > factor like: > 1) writeback context per NUMA node. I doubt that will create enough concurrency for a typical small server or desktop machine that only has a single NUMA node but has a couple of fast nvme SSDs in it. > 2) Fixed number of writeback contexts, say min(10, numcpu). > 3) NUMCPU/N number of writeback contexts. These don't take into account the concurrency available from the underlying filesystem or storage. That's the point I was making - CPU count has -zero- relationship to the concurrency the filesystem and/or storage provide the system. It is fundamentally incorrect to base decisions about IO concurrency on the number of CPU cores in the system. Same goes for using NUMA nodes, a fixed minimum, etc. > 4) Writeback context based on FS geometry like per AG for XFS, as per your > suggestion. Leave the concurrency and queue affinity selection up to the filesystem. If they don't specify anything, then we remain with a single queue and writeback thread. > > i.e. with enough RAM, this random write workload using buffered IO > > is pretty much guaranteed to outperform direct IO regardless of the > > underlying writeback concurrency. > > We tested making sure RAM is available for both buffered and direct IO. On a > system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device. > > fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \ > --ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \ > --direct=1 --eta-interval=1 --eta-newline=1 --group_reporting > > We can see the results which show direct IO exceed buffered IO by big margin. > > BW (MiB/s) buffered dontcache %improvement direct %improvement > randwrite (bs=4k) 3393 5397 59.06% 9315 174.53% DIO is faster on PMEM because it only copies the data once (i.e. user to PMEM) whilst buffered IO copies the data twice (user to page cache to PMEM). The IO has exactly the same CPU overhead regardless of in-memory page cache aggregation for buffered IO because IO cpu usage is dominated by the data copy overhead. Throughput benchmarks on PMEM are -always- CPU and memory bandwidth bound doing synchronous data copying. PMEM benchamrks may respond to CPU count based optimisations, but that should not be taken as "infrastructure needs to be CPU-count optimised. This is why it is fundamentally wrong to use PMEM for IO infrastructure optimisation: the behaviour it displays is completely different to pretty much all other types of IO device we support. PMEM is a niche, and it performs very poorly compared to a handful of NVMe SSDs being driven by the same CPUs... > > IMO, we need writeback to be optimised for is asynchronous IO > > dispatch through each filesystems; our writeback IOPS problems in > > XFS largely stem from the per-IO cpu overhead of block allocation in > > the filesystems (i.e. delayed allocation). > > This is a good idea, but it means we will not be able to paralellize within > an AG. The XFS allocation concurrency model scales out across AGs, not within AGs. If we need writeback concurrency within an AG (e.g. for pure overwrite concurrency) then we want async dispatch worker threads per writeback queue, not multiple writeback queues per AG.... > I will spend some time to build a POC with per AG writeback context, and > compare it with per-cpu writeback performance and extent fragmentation. Other > filesystems using delayed allocation will also need a similar scheme. The two are not exclusive like you seem to think they are. If you want to optimise your filesystem geometry for CPU count based concurrency, then on XFS all you need to do is this: # mkfs.xfs -d concurrency=nr_cpus /dev/pmem0 mkfs will then create a geometry that has sufficient concurrency for the number of CPUs in the host system. Hence we end up with -all- in-memory and on-disk filesystem structures on that PMEM device being CPU count optimised, not just writeback. Writeback is just a small part of a much larger system, and we already optimise filesystem and IO concurrency via filesystem geometry. Writeback needs to work within those same controls we already provide users with. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx