Re: XFS blocked task in xlog_cil_force_lsn

Kevin Richter <xfs@xxxxxxxxxxx> · Fri, 20 Dec 2013 11:26:47 +0100

Thanks a lot for your huge replies!

> Oh, that's quite old. I'd suggest upgrading to a much more recent
> kernel as we've fixed lots of issues in this area since then.

First I have switched to the newer kernel from Ubuntu Saucy:
$ uname -a
Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP

Thus, it seems that the default scheduler has been changed to deadline.
I did not change anything. After a reboot the schedulers of all disks
are now deadline.

> Model # of the CPUs so I can look up the specs?
Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz

I have prepared a folder with 60GB for the tests. This is nearly twice
of the available memory, so the process should be forced to actually
write the stuff to the disk - and not only hold in the memory.

> $ echo 256 > /sys/block/md2/md/stripe_cache_size
> $ time cp -a /olddisk/testfolder /6tb/foo1/
> real    25m38.925s
> user    0m0.595s
> sys     1m23.182s
>
> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
> $ time cp -a /olddisk/testfolder /raid/foo2/
> real    7m32.824s
> user    0m0.438s
> sys     1m6.759s
> 
> $ echo 2048 > /sys/block/md2/md/stripe_cache_size
> $ time cp -a /olddisk/testfolder /raid/foo3/
> real    5m32.847s
> user    0m0.418s
> sys     1m5.671s
> 
> $ echo 4096 > /sys/block/md2/md/stripe_cache_size
> $ time cp -a /olddisk/testfolder /raid/foo4/
> real    5m54.554s
> user    0m0.437s
> sys     1m6.268s

The difference is really amazing! So 2048 seems to be the best choice.
60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it?
The RAID only consist of 5 SATA disks with 7200rpm.

'top' while copying with stripe size of 2048 (the source disk is ntfs):
> top - 10:48:24 up 1 day,  1:41,  2 users,  load average: 5.66, 3.53, 2.17
> Tasks: 210 total,   2 running, 208 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.1%us, 35.8%sy,  0.0%ni, 46.0%id, 17.9%wa,  0.0%hi,  0.2%si,  0.0%st
> Mem:  32913992k total, 32709208k used,   204784k free, 10770344k buffers
> Swap:  7812496k total,        0k used,  7812496k free, 20866844k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 19524 root      20   0     0    0    0 R   93  0.0   4:00.12 kworker/3:1
> 23744 root      20   0     0    0    0 S   55  0.0   0:50.84 kworker/0:1
> 23738 root      20   0     0    0    0 S   29  0.0   0:56.94 kworker/4:0
>  3893 root      20   0     0    0    0 S   28  0.0  36:47.50 md2_raid6
>  4551 root      20   0 22060 3328  720 D   25  0.0  20:21.61 mount.ntfs
> 23273 root      20   0     0    0    0 S   22  0.0   1:54.86 kworker/7:2
> 23734 root      20   0 21752 1280 1040 D   21  0.0   0:49.84 cp
>    84 root      20   0     0    0    0 S    7  0.0   8:19.34 kswapd1
>    83 root      20   0     0    0    0 S    6  0.0  11:55.81 kswapd0
> 23745 root      20   0     0    0    0 S    2  0.0   0:33.60 kworker/1:2
> 21598 root      20   0     0    0    0 D    1  0.0   0:11.33 kworker/u17:1

And the best thing at all:
While all of these tests there are no warnings/panics in the syslog.

With best regards,
Kevin

Am 19.12.2013 15:11, schrieb Stan Hoeppner:
> On 12/18/2013 4:27 AM, Kevin Richter wrote:
>> Thanks for your mails!
>>
>>> This is unusual. How long have you waited?
>>
>> For the reboot? One night.
>> After the copy process hangs: several hours. But mostly it recovers
>> after several minutes.
>>
>>> 1.  Switch to deadline.  CFQ is not suitable for RAID storage, and not
>>> suitable for XFS.  This may not be a silver bullet but it will help.
>>
>> Can I switch it while my copy process (from a separate hd to this raid)
>> is running... without data loss? Otherwise I would wait a bit, because
>> now it is actually running for 8 hours without kernel panics.
> 
> Yes, you can safely change the elevator on the fly.  Repeat the
> following once for each disk, replacing X appropriately.
> 
> $ echo deadline > /sys/block/sdX/queue/scheduler
> 
> This on the fly change is not permanent.  Add "elevator=deadline" to
> your kernel command line in your bootloader config file.  Or you can add
> the commands to a startup script.  Or if you build your own kernels
> simply make it the default, and/or remove all the other elevators from
> your kernel config, which is what I do.
> 
>>> 2.  Post your chunk size and RAID6 stripe_cache_size value.  They may be
>>> sub optimal for your workload.
>>
>> $ cat /sys/block/md2/md/stripe_cache_size
>> 256
> 
> 256 is the default and it is way too low.  This is limiting your write
> througput.  Increase this to a minimum of 1024 which will give you a
> 20MB stripe cache buffer.  This should become active immediately.  Add
> it to a startup script to make it permanent.
> 
> $ echo 1024 > /sys/block/md2/md/stripe_cache_size
> 
> ((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache
> 
> At some future point when the system is idle, play with FIO write
> testing and multiple jobs (threads) with stripe_cache_size values up to
> 2048 to see which value gives the best throughput.  Test 1024, 2048,
> 4096.  Going higher probably won't gain anything.  And at some point as
> you go higher your throughput will decrease.  When you see that, select
> the previous value and make it permanent.
> 
>> $ mdadm --detail /dev/md2 | grep Chunk
>> Chunk Size : 512K
>>
>>> 3.  Post 'xfs_info /dev/mdX'
>>
>> There is a LUKS volume around /dev/md2, named '6tb'.
> 
> Ok, this adds a huge fly to the ointment.  LUKS is single threaded per
> device.  You've encrypted the md device instead of the individual disks.
>  This limits your encryption throughput to a single CPU core.  You've
> not provided any data on your LUKS thread.  Is it pegging a core during
> large file IO?  I'd bet it is.  That would explain all of this.
> 
>>> $ xfs_info /dev/md2
>>> xfs_info: /dev/md2 is not a mounted XFS filesystem
>>> $ xfs_info /dev/mapper/6tb
>>> meta-data=/dev/mapper/6tb        isize=256    agcount=32, agsize=45631360 blks
>>>          =                       sectsz=512   attr=2
>>> data     =                       bsize=4096   blocks=1460203520, imaxpct=5
>>>          =                       sunit=128    swidth=384 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0
>>> log      =internal               bsize=4096   blocks=521728, version=2
>>>          =                       sectsz=512   sunit=8 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width.  The
> XFS alignment matches the md device.  This may be hurting you with LUKS
> between XFS and md.  I don't know how LUKS handles 1.5MB inbound writes
> and if it passes an aligned write to md after encrypting.  I also don't
> know how it handles writes barriers passed down the stack by XFS, or if
> it does at all.
> 
>>> 4.  You're getting a lot of kswapd timeouts because you have swap and
>>> the md/RAID6 array on the same disks.  Relocate swap to disks that are
>>> not part of this RAID6.  Small SSDs are cheap and fast.  Buy one and put
>>> swap on it.  Or install more RAM in the machine.  Going the SSD route is
>>> better as it gives flexibility.  For instance, you can also relocate
>>> your syslog files to it and anything else that does IO without eating
>>> lots of space.  This decreases the IOPS load on your rust.
>>
>> No no, swap is not on any of the raid disks.
>>
>>> # cat /proc/swaps
>>> Filename                                Type            Size    Used    Priority
>>> /dev/sda3                               partition       7812496 0       -1
>> sda is not in the raid. In the raid there are sd[cdefg].
> 
> I was wrong here.  After reading a little about kswapd I now know it
> doesn't simply move pages to/from disk.  It frees memory in other ways
> as well.  In this case it's apparently waiting to free an inode being
> reclaimed by xfs_reclaim_inode, which is stalled on lower level IO.  So
> kswapd times out as a result.  A slow LUKS thread would explain this as
> well.
> 
>>> 5.  Describe in some detail the workload(s) causing the heavy IO, and
>>> thus these timeouts.
>>
>> cd /olddharddisk
>> cp -av . /raid/
>>
>> oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from
>> above.
> 
> So you are simply doing a huge file copy from an apparently slower
> single disk to an apparently faster striped RAID.  But you get horrible
> throughput and kernel timeout errors, and the RAID disks are apparently
> fairly idle.
> 
>> Heavy workload while this copy process (2 CPUs, each 4 cores):
> 
> Model # of the CPUs so I can look up the specs?
> 
>>> top - 11:13:37 up 4 days, 21:32,  2 users,  load average: 12.95, 11.33, 10.32
>>> Tasks: 155 total,   2 running, 153 sleeping,   0 stopped,   0 zombie
>>> Cpu(s):  0.0%us,  5.7%sy,  0.0%ni, 82.1%id, 11.8%wa,  0.0%hi,  0.3%si,  0.0%st
>>> Mem:  32916276k total, 32750240k used,   166036k free, 10076760k buffers
>>> Swap:  7812496k total,        0k used,  7812496k free, 21221136k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>   699 root      20   0     0    0    0 S   11  0.0 248:17.59 md2_raid6
>>
>> Dont know what consumes all of this 32GB RAM... 'top' sorted by memory
>> consumption does not tell me. All entries are only 0.0% and 0.1%
> 
> ~30GB of your RAM is being eaten by filesystem metadata buffers (10GB)
> and page cache (20GB).  Linux uses nearly all free memory for cache and
> buffers to prevent disk accesses, which speeds up access to frequently
> used data considerably.  This is normal.  A huge copy like this is going
> to cause a huge amount caching.
> 
> I assume this 1TB file copy has finally completed.  Start another large
> test copy and paste the CPU burn for the LUKS thread so we can confirm
> whether this is the root cause of your problem.
> 

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs