Re: Huge reduction in write bandwidth with filesystem vs direct block device

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 26 Jan 2021 07:58:56 +1100

On Mon, Jan 25, 2021 at 12:23:57PM -0500, Rick Warner wrote:
> Hi all,
> 
> We're working with NVME storage systems and are seeing a significant
> reduction in write speed with an XFS filesystem vs direct access to the
> block device.
> 
> Using a 5 disk software RAID5, we're able to get ~16GB/s write speed direct
> to the device.  If we put an XFS filesystem on the software RAID and run the
> same fio command (except --directory /xfs instead of --filename /dev/md11)
> we only get ~2.5GB/s write speed.

What kernel? What's the RAID configuration? What's the mkfs.xfs output?

https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

> Are there any tunables that could improve this? Is performance degradation
> this big considered a bug?

No evidence that there's a bug yet. You're not testing apples to
apples.

> 
> The fio runs showing this are below:
> 
> *******Direct to /dev/md11 block device
> [root@flashstore ~]# fio --filename=/dev/md11 --rw=write --numjobs=32
> --size=12G --bs=1M --name=1m --group_reporting
> 1m: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T)
> 1024KiB-1024KiB, ioengine=psync, iodepth=1

This is writing 32 concurrent jobs all running from offset 0 in the
block device doing single, synchronous IOs. IOWs, 32 IOs from offset
0, 32 IOs from offset 1MB, etc all the way through the block device.

It's not the same thing as doing 32 concurrent IOs to 32 different
files, all of which will be located in different physical offsets in
the RAID device. You've got a 32x increase in dirty LBAs for the
filesystem test, and no overlap between any of the jobs. So the RAID
device and all the storage devices are doing 32x as much work as the
block device job...

To replicate the same IO on the block device, each job writing to
the block device needs to be allocated it's own 12GB region in the
block device (i.e. job 0 at offset 0, job 1 at offset 12GB, etc) so
there is no overlapping, concurrent IO to the same device offset....

Also, synchronous IO is the key here: individual job throughput is
IO latency bound for synchronous IO and device queue depths are
greatly limited. Hence any submission or completion delay is going
to greatly impact test throughput. Nobody uses synchronous IO for
high throughput applications these days...

> Starting 32 processes
> Jobs: 26 (f=26): [f(1),_(1),f(16),_(1),f(3),_(2),f(3),_(1),f(2),_(1),f(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0
> IOPS][eta 00m:00s]
> 1m: (groupid=0, jobs=32): err= 0: pid=74592: Mon Jan 25 12:13:28 2021
>   write: IOPS=15.4k, BW=15.0GiB/s (16.1GB/s)(384GiB/25551msec)
>     clat (usec): min=230, max=31691, avg=2044.43, stdev=778.36
>      lat (usec): min=245, max=31710, avg=2067.97, stdev=783.21
>     clat percentiles (usec):
>      |  1.00th=[  420],  5.00th=[ 1745], 10.00th=[ 1811], 20.00th=[ 1860],
>      | 30.00th=[ 1893], 40.00th=[ 1926], 50.00th=[ 1942], 60.00th=[ 1975],
>      | 70.00th=[ 2024], 80.00th=[ 2089], 90.00th=[ 2180], 95.00th=[ 2900],
>      | 99.00th=[ 4490], 99.50th=[ 4883], 99.90th=[13829], 99.95th=[14746],
>      | 99.99th=[20841]

Note the IO latencies....

>    bw (  KiB/s): min=400606, max=679936, per=3.13%, avg=492489.85,
> stdev=53436.36, samples=1632
>    iops        : min=  391, max=  664, avg=480.90, stdev=52.18, samples=1632
>   lat (usec)   : 250=0.01%, 500=1.72%, 750=0.80%, 1000=0.09%
>   lat (msec)   : 2=62.47%, 4=32.77%, 10=1.94%, 20=0.20%, 50=0.01%
>   cpu          : usr=1.37%, sys=62.91%, ctx=38028757, majf=0, minf=60496
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      issued rwts: total=0,393216,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>   WRITE: bw=15.0GiB/s (16.1GB/s), 15.0GiB/s-15.0GiB/s (16.1GB/s-16.1GB/s),
> io=384GiB (412GB), run=25551-25551msec
> 
> Disk stats (read/write):
>     md11: ios=98/2237881, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=10252/73455, aggrmerge=20863/582117, aggrticks=4425/138224,
> aggrin_queue=130116, aggrutil=17.71%
>   nvme2n1: ios=12427/88141, merge=25370/698549, ticks=5030/163534,
> in_queue=153382, util=16.71%
>   nvme3n1: ios=12210/88148, merge=24728/698544, ticks=4979/162745,
> in_queue=152592, util=16.84%
>   nvme4n1: ios=12246/88150, merge=24861/698524, ticks=4875/165703,
> in_queue=156034, util=16.81%
>   nvme5n1: ios=12289/88146, merge=25200/698533, ticks=5013/164900,
> in_queue=154398, util=16.96%
>   nvme6n1: ios=12343/88149, merge=25021/698553, ticks=6655/172464,
> in_queue=164291, util=17.71%

Note the IOs reaching the disks here - ~90k writes each disk for the
entire test, so ~450,000 1MB IOs gives ~450GB which roughly matches
with the amount written when parity overhead is taken into account.

>   nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> 
> ******* mkfs.xfs on /dev/md11 (w/ no flags) and fio run on that mount
> [root@flashstore ~]# fio --directory=/xfs --rw=write --numjobs=32 --size=12G
> --bs=1M --name=1m --group_reporting

--fallocate=none is needed to make the filesystem workload a pure
overwrite workload the same as the block device. Otherwise
preallocation is used to lay out the files, and that then requires
unwritten conversion of the IO range during IO completion. Unwritten
conversion increases IO completion latency, which reduces bandwidth
for synchronous IO submission/completion.

> 1m: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T)
> 1024KiB-1024KiB, ioengine=psync, iodepth=1
> ...
> fio-3.7
> Starting 32 processes
> 1m: Laying out IO file (1 file / 12288MiB)
.....
> 1m: Laying out IO file (1 file / 12288MiB)
> Jobs: 11 (f=11): [_(6),W(1),_(4),W(1),_(8),W(1),_(3),W(8)][99.4%][r=0KiB/s,w=1213MiB/s][r=0,w=1213
> IOPS][eta 00m:01s]
> 1m: (groupid=0, jobs=32): err= 0: pid=74782: Mon Jan 25 12:20:32 2021
>   write: IOPS=2431, BW=2432MiB/s (2550MB/s)(384GiB/161704msec)
>     clat (usec): min=251, max=117777, avg=13006.54, stdev=23856.18
>      lat (usec): min=270, max=117787, avg=13027.39, stdev=23851.96
>     clat percentiles (usec):
>      |  1.00th=[  359],  5.00th=[  371], 10.00th=[  383], 20.00th=[ 408],
>      | 30.00th=[  424], 40.00th=[  453], 50.00th=[  537], 60.00th=[ 578],
>      | 70.00th=[  619], 80.00th=[55313], 90.00th=[58459], 95.00th=[60556],
>      | 99.00th=[63177], 99.50th=[64226], 99.90th=[66323], 99.95th=[68682],
>      | 99.99th=[80217]

Compare the IO latencies with the block device. Average is 13ms per
1MB IO for XFS vs 2ms for the block device. There's the entire
performance difference right there - 6x longer IO latency on XFS
means it will only acheive 1/6th the bandwidth of the block device
because synchronous IO.

It could just be that the filesystem isn't aligned to the underlying
RAID geometry, and so every 1MB IO the filesystem issues to the md
device is misaligned to the RAID stripe and so MD is doing huge
numbers of partial stripe writes and that's what is killing
performance....

That said, 13ms latency for a single 1MB IO on a device that should
take 2ms is potentially indicative of a scheduler bug: we've had
bugs in completion workqueue task scheduling that have caused
exactly this issue (10ms delays in IO completion task scheduling) in
recent times. This scheduler commit fixed multiple performance
regressions in XFS:

52262ee567ad ("sched/fair: Allow a per-CPU kthread waking a task to
stack on the same CPU, to fix XFS performance regression")

Which brings me back to "what kernel and platform are you
testing?"...

>    bw (  KiB/s): min=55296, max=1054720, per=3.15%, avg=78557.24,
> stdev=41896.99, samples=10233
>    iops        : min=   54, max= 1030, avg=76.67, stdev=40.92, samples=10233

Note that the per-job throughput peaked at over 1GB/s for a specific
job - that's what I'd expect to see on XFS for all jobs, not just an
occasional sample....

>   lat (usec)   : 500=46.14%, 750=31.31%, 1000=0.78%
>   lat (msec)   : 2=0.03%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.59%
>   lat (msec)   : 100=21.13%, 250=0.01%
>   cpu          : usr=0.12%, sys=3.83%, ctx=86515, majf=0, minf=22227
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
>      issued rwts: total=0,393216,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>   WRITE: bw=2432MiB/s (2550MB/s), 2432MiB/s-2432MiB/s (2550MB/s-2550MB/s),
> io=384GiB (412GB), run=161704-161704msec
> 
> Disk stats (read/write):
>     md11: ios=1/6097731, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
> aggrios=23774/2849499, aggrmerge=34232/17493878, aggrticks=28040/18125298,
> aggrin_queue=18363574, aggrutil=80.03%
>   nvme2n1: ios=28860/3419298, merge=41127/20992122, ticks=39496/23053174,
> in_queue=23421586, util=75.76%
>   nvme3n1: ios=28440/3419396, merge=41081/20992524, ticks=34881/23067448,
> in_queue=23411872, util=80.03%
>   nvme4n1: ios=28457/3419413, merge=41361/20992713, ticks=30990/21139316,
> in_queue=21420720, util=78.03%
>   nvme5n1: ios=28131/3419446, merge=40331/20992920, ticks=29288/20184431,
> in_queue=20418749, util=77.00%
>   nvme6n1: ios=28759/3419446, merge=41496/20992991, ticks=33587/21307424,
> in_queue=21508518, util=77.04%

That's not pretty. That looks like partial stripe writes - 3.4
million write IOs per device (instead of ~90,000) and 20m merges per
device is indicative of MD slicing and dicing the IO it is receiving
from the higher level up into a huge number of tiny IOs. i.e. the
devices are not getting well formed/well aligned IO so working out
why this is happening is probably the first thing to look at.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx