Re: XFS blocked task in xlog_cil_force_lsn

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Sun, 22 Dec 2013 02:12:37 -0600

On 12/21/2013 8:35 PM, Kevin Richter wrote:
> Okay, well, I've run your fio config... but with so many results and

I'm glad you changed to size=1G instead of size=10G lest the test would
have read/written 80GB instead of 8GB, and taken forever.  I often
forget the size parameter is per job (thread) not global.

> abbrevations I currently feel a bit overchallenged :)
> So, please dont take badly, if I just paste the whole results.
> I have tried with a stripe cache size of 1024, 2048 and 4096.

The full output is good.  Going from 1024 to 2048 gained you ~17MB/s
throughput with this job file.  But you didn't run with the default 256
for comparison, which is very useful to know.  That's ok, as you can do
so with the next set of test runs.  Given the huge latencies I'm
thinking this config may be seeking the disks too much.  Modify your fio
job file as follows:

numjobs=4
blocksize=1536k
iodepth=4
size=1g

as this may yield a little better results.  ~190MB/s read and 75MB/s
write is very low for three effective 7.2K RPM SATA spindles, especially
the write throughput.  Compare to a similar FIO job with 2 threads,
iodepth 16, directio, run against a single 500GB 7.2K WD SATA drive, NQC
disabled, w/deadline elevator:

Run status group 0 (all jobs):
   READ: io=2048MB, aggrb=88663KB/s, minb=90791KB/s, maxb=90791KB/s
Run status group 1 (all jobs):
  WRITE: io=2048MB, aggrb=108089KB/s, minb=110683KB/s, maxb=110683KB/s

88MB/s read, 108MB/s write.  Write is faster apparently due to the small
16MB drive write cache.  With your eight 2.4GHz SandyBridge cores and
md-RAID6 on much newer larger 7.2K drives w/larger caches, you should
have significantly greater write speed than this old single SATA drive.
 Currently you're showing 30MB/s slower write throughput, some 40%
slower than this single drive.  With aligned writes, thus no RMW, you
should be getting at least 2 spindles worth of throughput, or ~200MB/s.

So, run FIO again with the mods above, once with stripe_cache_size 256
and once with 2048.  Something seems to be seriously dragging down your
throughput.  I'm sure you'd like to know what, and how to fix it.

> Btw, I also have /dev/md1, this is a 15GB unencrypted array using the
> same 5 disks like my LUKS-md2 array, so nearly the same just without
> this LUKS layer. If helpful, I can execute some fio tests on this
> filesystem, too.

This is also a RAID6 array, without dm-crypt?  If so this will afford a
great comparison, and may be very informative.  Run FIO against this
filesystem (I assume it's also XFS) as well with stripe_cache_size 512
and 2048.  Four FIO tests total including the two on the big array.

It'll be interesting to see where this leads.  At this point the
candidates I see are a single dm-crypt thread peaking a core, dm-crypt
taking aligned writes and misaligning them causing all writes to be
RMWs, or AF drives with misaligned partitions causing every write to be
an internal RMW.  Or maybe something I've not considered.

> 
> So long :)
> Kevin
> 
> 
> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>> Jobs: 1 (f=1): [____________W___] [99.7% done] [0K/99.24M /s] [0 /193  iops] [eta 00m:06s]
>> read: (groupid=0, jobs=8): err= 0: pid=12987
>>   read : io=81920MB, bw=189835KB/s, iops=370 , runt=441890msec
>>     slat (usec): min=32 , max=4561 , avg=76.28, stdev=28.66
>>     clat (msec): min=5 , max=1115 , avg=334.19, stdev=151.18
>>      lat (msec): min=5 , max=1115 , avg=334.26, stdev=151.18
>>     bw (KB/s) : min=    0, max=261120, per=12.79%, avg=24288.95, stdev=11586.29
>>   cpu          : usr=0.05%, sys=0.50%, ctx=157180, majf=0, minf=16982
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.52%, 20=0.71%, 50=3.01%, 100=6.17%, 250=14.59%
>>      lat (msec): 500=68.81%, 750=4.87%, 1000=1.08%, 2000=0.25%
>> write: (groupid=1, jobs=8): err= 0: pid=13202
>>   write: io=81920MB, bw=58504KB/s, iops=114 , runt=1433851msec
>>     slat (usec): min=45 , max=1729 , avg=212.20, stdev=56.68
>>     clat (msec): min=14 , max=11691 , avg=1101.17, stdev=1116.82
>>      lat (msec): min=14 , max=11691 , avg=1101.39, stdev=1116.82
>>     bw (KB/s) : min=    0, max=106666, per=14.35%, avg=8395.94, stdev=6752.35
>>   cpu          : usr=0.28%, sys=0.10%, ctx=117451, majf=0, minf=3410
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>
>>      lat (msec): 20=0.01%, 50=0.17%, 100=2.03%, 250=17.02%, 500=22.20%
>>      lat (msec): 750=12.22%, 1000=8.66%, 2000=20.04%, >=2000=17.64%
>>
>> Run status group 0 (all jobs):
>>    READ: io=81920MB, aggrb=189834KB/s, minb=194390KB/s, maxb=194390KB/s, mint=441890msec, maxt=441890msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=81920MB, aggrb=58504KB/s, minb=59908KB/s, maxb=59908KB/s, mint=1433851msec, maxt=1433851msec
>>
>> Disk stats (read/write):
>>   dm-0: ios=327681/327756, merge=0/0, ticks=78591352/353235376, in_queue=431834680, util=100.00%, aggrios=327681/327922, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>     md2: ios=327681/327922, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=171660/222845, aggrmerge=4561629/9494657, aggrticks=16341417/4008187, aggrin_queue=20351472, aggrutil=85.78%
>>     sdc: ios=181512/222455, merge=4583055/9505634, ticks=15650600/5944080, in_queue=21596560, util=85.78%
>>     sdd: ios=180545/224362, merge=4526197/9587956, ticks=14356708/5542120, in_queue=19900820, util=85.42%
>>     sde: ios=179853/224317, merge=4519718/9540999, ticks=13375156/5676828, in_queue=19053876, util=83.90%
>>     sdf: ios=157605/222569, merge=4551205/9459549, ticks=18828608/1234632, in_queue=20065204, util=75.36%
>>     sdg: ios=158787/220525, merge=4627970/9379150, ticks=19496016/1643276, in_queue=21140904, util=77.26%
> 
> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
>> Jobs: 1 (f=1): [_________W______] [99.6% done] [0K/92182K /s] [0 /175  iops] [eta 00m:06s]
>> read: (groupid=0, jobs=8): err= 0: pid=6392
>>   read : io=81920MB, bw=185893KB/s, iops=363 , runt=451259msec
>>     slat (usec): min=32 , max=524 , avg=75.08, stdev=26.00
>>     clat (msec): min=8 , max=1849 , avg=335.65, stdev=149.76
>>      lat (msec): min=8 , max=1849 , avg=335.72, stdev=149.76
>>     bw (KB/s) : min=    0, max=105860, per=13.08%, avg=24308.99, stdev=7467.64
>>   cpu          : usr=0.05%, sys=0.49%, ctx=157968, majf=0, minf=17171
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.01%, 20=0.06%, 50=2.32%, 100=7.44%, 250=15.36%
>>      lat (msec): 500=69.09%, 750=4.50%, 1000=0.87%, 2000=0.35%
>> write: (groupid=1, jobs=8): err= 0: pid=6663
>>   write: io=81920MB, bw=75377KB/s, iops=147 , runt=1112887msec
>>     slat (usec): min=46 , max=6453.8K, avg=988.13, stdev=46664.22
>>     clat (msec): min=10 , max=8631 , avg=854.18, stdev=683.37
>>      lat (msec): min=10 , max=8631 , avg=855.17, stdev=684.82
>>     bw (KB/s) : min=    0, max=100352, per=14.04%, avg=10581.97, stdev=6871.80
>>   cpu          : usr=0.34%, sys=0.11%, ctx=92502, majf=0, minf=1531
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>
>>      lat (msec): 20=0.01%, 50=0.09%, 100=1.14%, 250=11.19%, 500=24.64%
>>      lat (msec): 750=18.78%, 1000=13.40%, 2000=24.53%, >=2000=6.23%
>>
>> Run status group 0 (all jobs):
>>    READ: io=81920MB, aggrb=185893KB/s, minb=190354KB/s, maxb=190354KB/s, mint=451259msec, maxt=451259msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=81920MB, aggrb=75376KB/s, minb=77186KB/s, maxb=77186KB/s, mint=1112887msec, maxt=1112887msec
>>
>> Disk stats (read/write):
>>   dm-0: ios=327701/328169, merge=0/0, ticks=79402348/261890568, in_queue=341301888, util=100.00%, aggrios=327701/328481, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>     md2: ios=327701/328481, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=127112/146994, aggrmerge=3364061/7997232, aggrticks=18091347/5817941, aggrin_queue=23914356, aggrutil=94.20%
>>     sdc: ios=135729/149357, merge=3341205/8134447, ticks=18312932/8398716, in_queue=26721324, util=92.77%
>>     sdd: ios=135561/151584, merge=3312121/8238249, ticks=16877204/8190968, in_queue=25077332, util=92.44%
>>     sde: ios=135741/146023, merge=3345948/7981968, ticks=17659792/8659780, in_queue=26322124, util=94.20%
>>     sdf: ios=114396/143768, merge=3413295/7801050, ticks=18551976/1652924, in_queue=20207384, util=72.89%
>>     sdg: ios=114134/144241, merge=3407738/7830447, ticks=19054832/2187320, in_queue=21243620, util=74.96%
> 
> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
>> Jobs: 1 (f=1): [________W_______] [100.0% done] [0K/95848K /s] [0 /182  iops] [eta 00m:00s]
>> read: (groupid=0, jobs=8): err= 0: pid=11787
>>   read : io=81920MB, bw=189274KB/s, iops=369 , runt=443200msec
>>     slat (usec): min=31 , max=4511 , avg=75.47, stdev=29.74
>>     clat (msec): min=5 , max=1338 , avg=336.39, stdev=155.14
>>      lat (msec): min=5 , max=1338 , avg=336.47, stdev=155.14
>>     bw (KB/s) : min=    0, max=253455, per=12.77%, avg=24162.01, stdev=11368.71
>>   cpu          : usr=0.05%, sys=0.49%, ctx=157193, majf=0, minf=17313
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=163840/0/0, short=0/0/0
>>
>>      lat (msec): 10=0.47%, 20=0.66%, 50=2.95%, 100=6.33%, 250=14.56%
>>      lat (msec): 500=68.12%, 750=5.42%, 1000=1.15%, 2000=0.33%
>> write: (groupid=1, jobs=8): err= 0: pid=12060
>>   write: io=81920MB, bw=64993KB/s, iops=126 , runt=1290687msec
>>     slat (usec): min=61 , max=16991 , avg=197.22, stdev=110.87
>>     clat (msec): min=14 , max=2820 , avg=980.92, stdev=366.56
>>      lat (msec): min=14 , max=2821 , avg=981.12, stdev=366.56
>>     bw (KB/s) : min=    0, max=103770, per=13.11%, avg=8517.92, stdev=3794.28
>>   cpu          : usr=0.28%, sys=0.08%, ctx=84352, majf=0, minf=723
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued r/w/d: total=0/163840/0, short=0/0/0
>>
>>      lat (msec): 20=0.01%, 50=0.03%, 100=0.28%, 250=0.22%, 500=5.37%
>>      lat (msec): 750=22.02%, 1000=31.66%, 2000=39.27%, >=2000=1.16%
>>
>> Run status group 0 (all jobs):
>>    READ: io=81920MB, aggrb=189273KB/s, minb=193816KB/s, maxb=193816KB/s, mint=443200msec, maxt=443200msec
>>
>> Run status group 1 (all jobs):
>>   WRITE: io=81920MB, aggrb=64993KB/s, minb=66553KB/s, maxb=66553KB/s, mint=1290687msec, maxt=1290687msec
>>
>> Disk stats (read/write):
>>   dm-0: ios=327681/327629, merge=0/0, ticks=78990724/301988444, in_queue=380991692, util=100.00%, aggrios=327681/327709, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
>>     md2: ios=327681/327709, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=120158/119573, aggrmerge=2858405/7663126, aggrticks=17291831/9119008, aggrin_queue=26414023, aggrutil=99.60%
>>     sdc: ios=135579/119976, merge=2813832/7324879, ticks=13974928/2192484, in_queue=16167996, util=66.57%
>>     sdd: ios=136115/127048, merge=2826584/7736191, ticks=12932248/2477796, in_queue=15410924, util=68.08%
>>     sde: ios=136007/130908, merge=2844473/7936354, ticks=12642232/3141268, in_queue=15784336, util=71.86%
>>     sdf: ios=78473/94458, merge=2882361/7865984, ticks=29053772/37421808, in_queue=66488856, util=99.60%
>>     sdg: ios=114620/125479, merge=2924777/7452224, ticks=17855976/361684, in_queue=18218004, util=54.84%
> 
> 
> 
> 
> 
> Am 20.12.2013 13:36, schrieb Stan Hoeppner:
>> On 12/20/2013 4:26 AM, Kevin Richter wrote:
>>> Thanks a lot for your huge replies!
>>
>> You're welcome.
>>
>>>> Oh, that's quite old. I'd suggest upgrading to a much more recent
>>>> kernel as we've fixed lots of issues in this area since then.
>>>
>>> First I have switched to the newer kernel from Ubuntu Saucy:
>>> $ uname -a
>>> Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP
>>>
>>> Thus, it seems that the default scheduler has been changed to deadline.
>>> I did not change anything. After a reboot the schedulers of all disks
>>> are now deadline.
>>
>> Good move Ubuntu.
>>
>>>> Model # of the CPUs so I can look up the specs?
>>> Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz
>>
>> Strong CPUs.
>>
>>> I have prepared a folder with 60GB for the tests. This is nearly twice
>>> of the available memory, so the process should be forced to actually
>>> write the stuff to the disk - and not only hold in the memory.
>>>
>>>> $ echo 256 > /sys/block/md2/md/stripe_cache_size
>>>> $ time cp -a /olddisk/testfolder /6tb/foo1/
>>>> real    25m38.925s
>>>> user    0m0.595s
>>>> sys     1m23.182s
>>>>
>>>> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
>>>> $ time cp -a /olddisk/testfolder /raid/foo2/
>>>> real    7m32.824s
>>>> user    0m0.438s
>>>> sys     1m6.759s
>>>>
>>>> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
>>>> $ time cp -a /olddisk/testfolder /raid/foo3/
>>>> real    5m32.847s
>>>> user    0m0.418s
>>>> sys     1m5.671s
>>>>
>>>> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
>>>> $ time cp -a /olddisk/testfolder /raid/foo4/
>>>> real    5m54.554s
>>>> user    0m0.437s
>>>> sys     1m6.268s
>>>
>>> The difference is really amazing! So 2048 seems to be the best choice.
>>> 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
>>> The RAID only consist of 5 SATA disks with 7200rpm.
>>
>> A lot of the source data is being cached between runs so these numbers
>> aren't accurate.  The throughput of this copy operation will be limited
>> by the speed of the single source disk, not the array.  To make the
>> elapsed times of this copy test accurate you need to execute something
>> like these commands after each run:
>>
>> # sync
>> # echo 3 > /proc/sys/vm/drop_caches
>>
>> But this copy test will not inform you about the potential peak
>> performance of your array.  That's why I suggested you test with FIO,
>> the flexible IO tester.
>>
>> # aptitude install fio
>> # man fio
>>
>> Sample job file suitable for your system:
>>
>> [global]
>> directory=/your/XFS/test/directory
>> zero_buffers
>> numjobs=8
>> group_reporting
>> blocksize=512k
>> ioengine=libaio
>> iodepth=16
>> direct=1
>> size=10g
>>
>> [read]
>> rw=read
>> stonewall
>>
>> [write]
>> rw=write
>> stonewall
>>
>> This should give you a relatively accurate picture of the actual
>> potential throughput of your array and filesystem.
>>
>>> 'top' while copying with stripe size of 2048 (the source disk is ntfs):
>>>> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
>>>> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
>>>> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
>>>> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
>>>>
>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
>>>> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
>>>> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
>>>>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
>>>>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
>>>> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
>>>> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
>>>>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
>>>>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
>>>> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
>>>> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1
>>
>> Hmm, what's kworker/3:1?  That's not a crypto thread eating 93% of a
>> SandyBridge core at only ~180 MB/s throughput is it?
>>
>>> And the best thing at all:
>>> While all of these tests there are no warnings/panics in the syslog.
>>>
>>> With best regards,
>>> Kevin
>>
>> Even though XFS wasn't the cause of the problem I'm glad we were able to
>> help you fix it nonetheless.  I'm really curious to see what kind of
>> throughput you can achieve with FIO, and whether crypto is a bottleneck
>> at the 250-350 MB/s your array should be capable of.  It would be great
>> if you would play around a bit with FIO and post some numbers.
>>
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs