Thanks a lot for your huge replies! > Oh, that's quite old. I'd suggest upgrading to a much more recent > kernel as we've fixed lots of issues in this area since then. First I have switched to the newer kernel from Ubuntu Saucy: $ uname -a Linux 3.11.0-14-generic #21~precise1-Ubuntu SMP Thus, it seems that the default scheduler has been changed to deadline. I did not change anything. After a reboot the schedulers of all disks are now deadline. > Model # of the CPUs so I can look up the specs? Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz I have prepared a folder with 60GB for the tests. This is nearly twice of the available memory, so the process should be forced to actually write the stuff to the disk - and not only hold in the memory. > $ echo 256 > /sys/block/md2/md/stripe_cache_size > $ time cp -a /olddisk/testfolder /6tb/foo1/ > real 25m38.925s > user 0m0.595s > sys 1m23.182s > > $ echo 1024 > /sys/block/md2/md/stripe_cache_size > $ time cp -a /olddisk/testfolder /raid/foo2/ > real 7m32.824s > user 0m0.438s > sys 1m6.759s > > $ echo 2048 > /sys/block/md2/md/stripe_cache_size > $ time cp -a /olddisk/testfolder /raid/foo3/ > real 5m32.847s > user 0m0.418s > sys 1m5.671s > > $ echo 4096 > /sys/block/md2/md/stripe_cache_size > $ time cp -a /olddisk/testfolder /raid/foo4/ > real 5m54.554s > user 0m0.437s > sys 1m6.268s The difference is really amazing! So 2048 seems to be the best choice. 60GB in 5,5minutes are 180MB/sek. That sounds a bit high, doesnt it? The RAID only consist of 5 SATA disks with 7200rpm. 'top' while copying with stripe size of 2048 (the source disk is ntfs): > top - 10:48:24 up 1 day, 1:41, 2 users, load average: 5.66, 3.53, 2.17 > Tasks: 210 total, 2 running, 208 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.1%us, 35.8%sy, 0.0%ni, 46.0%id, 17.9%wa, 0.0%hi, 0.2%si, 0.0%st > Mem: 32913992k total, 32709208k used, 204784k free, 10770344k buffers > Swap: 7812496k total, 0k used, 7812496k free, 20866844k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 19524 root 20 0 0 0 0 R 93 0.0 4:00.12 kworker/3:1 > 23744 root 20 0 0 0 0 S 55 0.0 0:50.84 kworker/0:1 > 23738 root 20 0 0 0 0 S 29 0.0 0:56.94 kworker/4:0 > 3893 root 20 0 0 0 0 S 28 0.0 36:47.50 md2_raid6 > 4551 root 20 0 22060 3328 720 D 25 0.0 20:21.61 mount.ntfs > 23273 root 20 0 0 0 0 S 22 0.0 1:54.86 kworker/7:2 > 23734 root 20 0 21752 1280 1040 D 21 0.0 0:49.84 cp > 84 root 20 0 0 0 0 S 7 0.0 8:19.34 kswapd1 > 83 root 20 0 0 0 0 S 6 0.0 11:55.81 kswapd0 > 23745 root 20 0 0 0 0 S 2 0.0 0:33.60 kworker/1:2 > 21598 root 20 0 0 0 0 D 1 0.0 0:11.33 kworker/u17:1 And the best thing at all: While all of these tests there are no warnings/panics in the syslog. With best regards, Kevin Am 19.12.2013 15:11, schrieb Stan Hoeppner: > On 12/18/2013 4:27 AM, Kevin Richter wrote: >> Thanks for your mails! >> >>> This is unusual. How long have you waited? >> >> For the reboot? One night. >> After the copy process hangs: several hours. But mostly it recovers >> after several minutes. >> >>> 1. Switch to deadline. CFQ is not suitable for RAID storage, and not >>> suitable for XFS. This may not be a silver bullet but it will help. >> >> Can I switch it while my copy process (from a separate hd to this raid) >> is running... without data loss? Otherwise I would wait a bit, because >> now it is actually running for 8 hours without kernel panics. > > Yes, you can safely change the elevator on the fly. Repeat the > following once for each disk, replacing X appropriately. > > $ echo deadline > /sys/block/sdX/queue/scheduler > > This on the fly change is not permanent. Add "elevator=deadline" to > your kernel command line in your bootloader config file. Or you can add > the commands to a startup script. Or if you build your own kernels > simply make it the default, and/or remove all the other elevators from > your kernel config, which is what I do. > >>> 2. Post your chunk size and RAID6 stripe_cache_size value. They may be >>> sub optimal for your workload. >> >> $ cat /sys/block/md2/md/stripe_cache_size >> 256 > > 256 is the default and it is way too low. This is limiting your write > througput. Increase this to a minimum of 1024 which will give you a > 20MB stripe cache buffer. This should become active immediately. Add > it to a startup script to make it permanent. > > $ echo 1024 > /sys/block/md2/md/stripe_cache_size > > ((4096*stripe-cache_size)*num_drives)= RAM consumed for stripe cache > > At some future point when the system is idle, play with FIO write > testing and multiple jobs (threads) with stripe_cache_size values up to > 2048 to see which value gives the best throughput. Test 1024, 2048, > 4096. Going higher probably won't gain anything. And at some point as > you go higher your throughput will decrease. When you see that, select > the previous value and make it permanent. > >> $ mdadm --detail /dev/md2 | grep Chunk >> Chunk Size : 512K >> >>> 3. Post 'xfs_info /dev/mdX' >> >> There is a LUKS volume around /dev/md2, named '6tb'. > > Ok, this adds a huge fly to the ointment. LUKS is single threaded per > device. You've encrypted the md device instead of the individual disks. > This limits your encryption throughput to a single CPU core. You've > not provided any data on your LUKS thread. Is it pegging a core during > large file IO? I'd bet it is. That would explain all of this. > >>> $ xfs_info /dev/md2 >>> xfs_info: /dev/md2 is not a mounted XFS filesystem >>> $ xfs_info /dev/mapper/6tb >>> meta-data=/dev/mapper/6tb isize=256 agcount=32, agsize=45631360 blks >>> = sectsz=512 attr=2 >>> data = bsize=4096 blocks=1460203520, imaxpct=5 >>> = sunit=128 swidth=384 blks >>> naming =version 2 bsize=4096 ascii-ci=0 >>> log =internal bsize=4096 blocks=521728, version=2 >>> = sectsz=512 sunit=8 blks, lazy-count=1 >>> realtime =none extsz=4096 blocks=0, rtextents=0 > > md2 is 5 spindle RAID6, geometry 512KB * 3 = 1.5MB stripe width. The > XFS alignment matches the md device. This may be hurting you with LUKS > between XFS and md. I don't know how LUKS handles 1.5MB inbound writes > and if it passes an aligned write to md after encrypting. I also don't > know how it handles writes barriers passed down the stack by XFS, or if > it does at all. > >>> 4. You're getting a lot of kswapd timeouts because you have swap and >>> the md/RAID6 array on the same disks. Relocate swap to disks that are >>> not part of this RAID6. Small SSDs are cheap and fast. Buy one and put >>> swap on it. Or install more RAM in the machine. Going the SSD route is >>> better as it gives flexibility. For instance, you can also relocate >>> your syslog files to it and anything else that does IO without eating >>> lots of space. This decreases the IOPS load on your rust. >> >> No no, swap is not on any of the raid disks. >> >>> # cat /proc/swaps >>> Filename Type Size Used Priority >>> /dev/sda3 partition 7812496 0 -1 >> sda is not in the raid. In the raid there are sd[cdefg]. > > I was wrong here. After reading a little about kswapd I now know it > doesn't simply move pages to/from disk. It frees memory in other ways > as well. In this case it's apparently waiting to free an inode being > reclaimed by xfs_reclaim_inode, which is stalled on lower level IO. So > kswapd times out as a result. A slow LUKS thread would explain this as > well. > >>> 5. Describe in some detail the workload(s) causing the heavy IO, and >>> thus these timeouts. >> >> cd /olddharddisk >> cp -av . /raid/ >> >> oldhardddisk is a mounted 1tb old harddisk, /raid is the 6tb raid from >> above. > > So you are simply doing a huge file copy from an apparently slower > single disk to an apparently faster striped RAID. But you get horrible > throughput and kernel timeout errors, and the RAID disks are apparently > fairly idle. > >> Heavy workload while this copy process (2 CPUs, each 4 cores): > > Model # of the CPUs so I can look up the specs? > >>> top - 11:13:37 up 4 days, 21:32, 2 users, load average: 12.95, 11.33, 10.32 >>> Tasks: 155 total, 2 running, 153 sleeping, 0 stopped, 0 zombie >>> Cpu(s): 0.0%us, 5.7%sy, 0.0%ni, 82.1%id, 11.8%wa, 0.0%hi, 0.3%si, 0.0%st >>> Mem: 32916276k total, 32750240k used, 166036k free, 10076760k buffers >>> Swap: 7812496k total, 0k used, 7812496k free, 21221136k cached >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 699 root 20 0 0 0 0 S 11 0.0 248:17.59 md2_raid6 >> >> Dont know what consumes all of this 32GB RAM... 'top' sorted by memory >> consumption does not tell me. All entries are only 0.0% and 0.1% > > ~30GB of your RAM is being eaten by filesystem metadata buffers (10GB) > and page cache (20GB). Linux uses nearly all free memory for cache and > buffers to prevent disk accesses, which speeds up access to frequently > used data considerably. This is normal. A huge copy like this is going > to cause a huge amount caching. > > I assume this 1TB file copy has finally completed. Start another large > test copy and paste the CPU burn for the LUKS thread so we can confirm > whether this is the root cause of your problem. > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs