On Wed, Oct 13, 2021 at 07:13:10PM +0200, Holger Hoffstätte wrote: > Hi, > > Based on what's going on in blk-mq & NVMe land What's going on in this area that is any different from the past few years? > I though I'd check if XFS still > sorts buffers before sending them down the pipe, and sure enough that still > happens in xfs_buf.c:xfs_buf_delwri_submit_buffers() (the comparson function > is directly above). Before I make a fool of myself and try to remove this, > do we still think this is necessary? Yes, I do. A though experiment for you, which you can then back up with actual simulation with fio: What is more efficient and faster at the hardware level: 16 individual sequential 4kB IOs or one 64kB IO? Which of these uses less CPU to dispatch and complete? Which has less IO in flight and so allows more concurrent IO to be dispatched to the hardware at the same time? Answering these questions will give you your answer. So, play around with AIO to simulate xfs buffer IO - the xfs buffer cache is really just a high concurrency async IO engine. Use fio to submit a series of individual sequential 4kB AIO writes with a queue depth of, say, 128 to a file and time it. Then submit the same number of sequential 4kB AIO writes as batches of 64 IOs at a time. Which one is faster? Why is it faster? You'll need to play around with fio queue batching controls to do this, but you can simulate it quite easily. The individual sequential IO fio simulation is the equivalent of eliding the buffer sort in xfs_buf_delwri_submit_buffers(), whilst the batched submission is equivalent of what we have now. Just because your hardware can do a million IOPS, it doesn't mean the most efficient way to do IO is to dispatch a million IOPS.... > If there's a scheduler it will do the > same thing, and SSD/NVMe might do the same in HW anyway or not care. > The only scenario I can think of where this might make a difference is > rotational RAID without scheduler attached. Not sure. Schedulers only have a reorder window of a 100-200 individual IOs. xfs_buf_delwri_submit_buffers() can be passed tens of thousands of buffers in a single list for IO dispatch that need reordering. IOWs, the sort+merge window for number of IOs we dispatch from metadata writeback is often orders of magnitude larger than what the block layer scheduler can optimise effectively. The IO stack behaviour is largely GIGO, so anythign we can do at a highly layer to make the IO submission less garbage-like results in improved throughput through the software and hardware layers of the storage stack. > I'm looking forward to hear what a foolish idea this is. If the list_sort() is not showing up in profiles, then it is essentially free. Last time I checked, our biggest overhead was the CPU overhead of flushing a million inodes/s from the AIL to their backing buffers - the list_sort() didn't even show up in the top 50 functions in the profile... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx