On Fri, Jul 12, 2024 at 05:54:05AM +0200, Dragan Milivojević wrote: > On 11/07/2024 01:12, Dave Chinner wrote: > > Probably not a lot you can do short of reconfiguring your RAID6 > > storage devices to handle small IOs better. However, in general, > > RAID6 /always sucks/ for small IOs, and the only way to fix this > > problem is to use high performance SSDs to give you a massive excess > > of write bandwidth to burn on write amplification.... > RAID5/6 has the same issues with NVME drives. > Major issue is the bitmap. That's irrelevant to the problem being discussed. The OP is reporting stalls due to the bursty incoming workload vastly outpacing the rate of draining of storage device. the above comment is not about how close to "raw performace" the MD device gets on NVMe SSDs - it's about how much faster it is for the given workload than HDDs. i.e. waht matters is the relative performance differential, and according to you numbers below, it is at least two orders of magnitude. That would make a 100s stall into a 1s stall, and that would largely make the OP's problems go away.... > 5 disk NVMe RAID5, 64K chunk > > Test BW IOPS > bitmap internal 64M 700KiB/s 174 > bitmap internal 128M 702KiB/s 175 > bitmap internal 512M 1142KiB/s 285 > bitmap internal 1024M 40.4MiB/s 10.3k > bitmap internal 2G 66.5MiB/s 17.0k > bitmap external 64M 67.8MiB/s 17.3k > bitmap external 1024M 76.5MiB/s 19.6k > bitmap none 80.6MiB/s 20.6k > Single disk 1K 54.1MiB/s 55.4k > Single disk 4K 269MiB/s 68.8k > > Tested with fio --filename=/dev/md/raid5 --direct=1 --rw=randwrite > --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --numjobs=1 > --group_reporting --time_based --name=Raid5 Oh, you're only testing a single depth block aligned async direct IO random write to the block device. The problem case that was reported was unaligned, synchronous buffered IO to multiple files through the the filesystem page cache (i.e. RMW at the page cache level as well as the MD device) at IO depths of up to 64 with periodic fsyncs thrown into the mix. So the OP's workload was not only doing synchronous buffered writes, they also triggered a lot of dependent synchronous random read IO to go with the async write IOs issued by fsyncs and page cache writeback. If you were to simulate all that, I would expect that the difference between HDDs and NVMe SSDs to be much greater than just 2 orders of magnitude. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx