On 3/17/2014 12:43 AM, Adam Goryachev wrote: > On 13/03/14 22:58, Stan Hoeppner wrote: >> On 3/12/2014 9:49 PM, Adam Goryachev wrote: >> ... >>> Number Major Minor RaidDevice State >>> 7 8 33 0 active sync /dev/sdc1 >>> 6 8 1 1 active sync /dev/sda1 >>> 8 8 49 2 active sync /dev/sdd1 >>> 5 8 81 3 active sync /dev/sdf1 >>> 9 8 65 4 active sync /dev/sde1 >> ... >>> /dev/sda Total_LBAs_Written 845235 >>> /dev/sdc Total_LBAs_Written 851335 >>> /dev/sdd Total_LBAs_Written 804564 >>> /dev/sde Total_LBAs_Written 719767 >>> /dev/sdf Total_LBAs_Written 719982 >> ... >>> So the drive with the highest writes 851335 and the drive with the >>> lowest writes 719982 show a big difference. Perhaps I have a problem >>> with the setup/config of my array, or similar? >> This is normal for striped arrays. If we reorder your write statistics >> table to reflect array device order, we can clearly see the effect of >> partial stripe writes. These are new file allocations, appends, etc >> that are smaller than stripe width. Totally normal. To get these close >> to equal you'd need a chunk size of 16K or smaller. > > Would that have a material impact on performance? Not with SSDs. If this was a rust array you'd probably want an 8KB or 16KB chunk to more evenly spread the small write IOs. > While current wear stats (Media Wearout Indicator) are all 98 or higher, > at some point, would it be reasonable to fail the drive with the lowest > write count, and then use it to replace the drive with the highest write > count, repeating twice, so that over the next period of time usage > should merge toward the average? Given the current wear rate, will > probably replace all the drives in 5 years, which is well before they > reach 50% wear anyway. Given the level of production write activity on your array, doing what you suggest above will simply cause leapfrogging, taking drives with lesser wear on them and shooting them way out in front of the drives with the most wear. In fact, any array operations you perform are putting far more wear on the flash cells than normal operation is. >>> So, I could simply do the following: >>> mdadm --manage /dev/md1 --add /dev/sdb1 >>> mdadm --grow /dev/md1 --raid-devices=6 >>> >>> Probably also need to remove the bitmap and re-add the bitmap. >> Might want to do >> >> ~$ echo 250000 > /proc/sys/dev/raid/speed_limit_min >> ~$ echo 500000 > /proc/sys/dev/raid/speed_limit_min >> >> That'll bump min resync to 250 MB/s per drive, max 500 MB/s. IIRC the >> defaults are 1 MB/s and 100 MB/s. > > Worked perfectly on one machine, the second machine hung, and basically > crashed. Almost turned into a disaster, but thankfully having two copies > over the two machines I managed to get everything sorted. After a > reboot, the second machine recovered and it grew the array also. See: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629442 This is the backup machine, yes? Last info I had from you said this box was using rust not SSD. Is that still the case? If so you should not have bumped the reshape speed upward as rust can't handle it, especially with load other than md on it. Also, I recall you had to install a backport kernel on san1 as well as a new iscsi-target package. What kernel and iscsi-target version is running on each of san1 and san2. I'm guessing they're not the same. What elevator is configured on san1 and san2? It should be noop for SSD and deadline for rust. > Some of the logs from that time: > Mar 13 23:05:59 san2 kernel: [42511.418380] RAID conf printout: > Mar 13 23:05:59 san2 kernel: [42511.418385] --- level:5 rd:6 wd:6 > Mar 13 23:05:59 san2 kernel: [42511.418388] disk 0, o:1, dev:sdc1 > Mar 13 23:05:59 san2 kernel: [42511.418390] disk 1, o:1, dev:sde1 > Mar 13 23:05:59 san2 kernel: [42511.418392] disk 2, o:1, dev:sdd1 > Mar 13 23:05:59 san2 kernel: [42511.418394] disk 3, o:1, dev:sdf1 > Mar 13 23:05:59 san2 kernel: [42511.418396] disk 4, o:1, dev:sda1 > Mar 13 23:05:59 san2 kernel: [42511.418399] disk 5, o:1, dev:sdb1 > Mar 13 23:05:59 san2 kernel: [42511.418444] md: reshape of RAID array md1 > Mar 13 23:05:59 san2 kernel: [42511.418448] md: minimum _guaranteed_ > speed: 1000 KB/sec/disk. > Mar 13 23:05:59 san2 kernel: [42511.418451] md: using maximum available > idle IO bandwidth (but not more than 200000 KB/sec) for reshape. > Mar 13 23:05:59 san2 kernel: [42511.418493] md: using 128k window, over > a total of 468847936k. > Mar 13 23:06:00 san2 kernel: [42511.512165] md: md_do_sync() got signal > ... exiting > Mar 13 23:07:01 san2 kernel: [42573.067781] iscsi_trgt: Abort Task (01) > issued on tid:9 lun:0 by sid:8162774362161664 (Function Complete) > Mar 13 23:07:01 san2 kernel: [42573.067789] iscsi_trgt: Abort Task (01) > issued on tid:11 lun:0 by sid:7318349599801856 (Function Complete) > Mar 13 23:07:01 san2 kernel: [42573.067797] iscsi_trgt: Abort Task (01) > issued on tid:12 lun:0 by sid:6473924787110400 (Function Complete) > Mar 13 23:07:01 san2 kernel: [42573.067838] iscsi_trgt: Abort Task (01) > issued on tid:14 lun:0 by sid:5348025014485504 (Function Complete) > Mar 13 23:07:02 san2 kernel: [42573.237591] iscsi_trgt: Abort Task (01) > issued on tid:8 lun:0 by sid:4503599899804160 (Function Complete) > Mar 13 23:07:02 san2 kernel: [42573.237600] iscsi_trgt: Abort Task (01) > issued on tid:2 lun:0 by sid:14918173819994624 (Function Complete) ... > I probably hit CTRL-C causing the "got signal... exiting" because the > system wasn't responding. There are a *lot* more iscsi errors and then > these: > Mar 13 23:09:09 san2 kernel: [42700.645060] INFO: task md1_raid5:314 > blocked for more than 120 seconds. The md write thread blocked for more than 2 minutes. Often these timeouts are due to multiple processes fighting for IO. This leads me to believe san2 has rust based disk, and that the kernel and other tweaks applied to san1 were not applied to san2. ... > This did lead to another observation.... The speed of the resync seemed > limited by something other than disk IO. On both san1/san2 or just san1? I'm assuming for now you mean san1 only. > It was usually around 250 to > 300MB/s, the maximum achieved was around 420MB/s. I also noticed that > idle CPU time on one of the cores was relatively low, though I never saw > it hit 0 (minimum I saw was 12% idle, average around 20%). Never look at idle, but what's eating the CPU. Was that 80+% being eaten by sys, wa, or a process? Without that information it's not possible to definitely answer your questions below. Do note, recall that during fio testing you were hitting 1.6 GB/s write throughput, ~4x greater than the resync throughput stated above. If one of your cores was at greater than 80% utilization with only ~420 MB/s of resync throughput, then something other than the md write thread was hammering that core. > So, I'm wondering whether I should consider upgrading the CPU and/or > motherboard to try and improve peak performance? As I mentioned after walking you through all of the fio testing, you have far more hardware than your workload needs. > Currently I have Intel Xeon E3-1230V2/3.3GHz/8MB > Cache/4core/8thread/5GTs, my supplier has offered a number of options: > 1) Compatible with current motherboard > Intel Xeon E3-1280V2/3.6GHz/8MB Cache/4core/8thread/5GTs This may gain you 5% peak RAID5 throughput. > 2) Intel Xeon E5-2620V2/2.1GHz/15MB Cache/6core/12thread/5GTs > 3) Intel Xeon E5-2630V2/2.6GHz/15MB Cache/6core/12thread/7.2GTs Both of these will decrease your peak RAID5 throughput quite markedly. md raid5 is clock sensitive, not cache sensitive. > My understanding is that the RAID5 is single threaded, so will work best > with a higher speed single core CPU compared to a larger number of cores > at a lower speed. However, I'm not sure how much "work" is being done > across the various models. ie, does a E5 CPU do more work even though it > has a lower clock speed? Does this carry over to the E7 class as well? You're chasing a red herring. Any performance issue you currently have, and I've seen no evidence of such to this point, is not due to the model of CPU in the box. It's due to tuning, administration, etc. > Currently I'm looking to replace at least the motherboard with > http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCM-F.cfm in > order to get 2 of the PCIe 2.0 8x slots (one for the existing LSI SATA > controller and one for a dual port 10Gb ethernet card. This will provide > a 10Gb cross-over connection between the two server, plus replace the 8 > x 1G ports with a single 10Gb port (solving the load balancing across > the multiple links issue). Finally, this 28 port (4 x 10G + 24 x 1G) > switch Adam if you have the budget now I absolutely agree that 10 GbE is a much better solution than the multi-GbE setup. But you don't need a new motherboard. The S1200BTLR has 4 PCIe 2.0 slots: one x8 electrical in x16 physical slot, and three x4 electrical in x8 physical slots. Your bandwidth per slot is: x8 4 GB/s unidirectional x2 <- occupied by LSI SAS HBA x4 2 GB/s unidirectional x2 <- occupied by quad port GbE cards 10 Gbps Ethernet has a 1 GB/s effective data rate one way. Inserting an x8 PCIe card into an x4 electrical/x8 physical slot gives you 4 active lanes for 2+2 GB/s bandwidth. This is an exact match for a dual port 10 GbE card. You could install up to three dual port 10 GbE cards into these 3 slots of the S1200BTLR. > http://www.netgear.com.au/business/products/switches/stackable-smart-switches/GS728TXS.aspx# > should allow the 2 x 10G connections to be connected through to the 8 > servers with 2 x 1G connections each using multipath scsi to setup two > connections (one on each 1G port) with the same destination (10G port) > > Any suggestions/comments would be welcome. You'll want use SFP+ NICs and passive Twin-Ax cables to avoid paying the $2000 fiber tax, as that is what four SFP+ 10 Gbit fiber LC transceivers cost--$500 each. The only SFP+ Intel dual port 10 GbE NIC that ships with vacant SFP+ ports is the X520-DA2: http://www.newegg.com/Product/Product.aspx?Item=N82E16833106044 To connect the NICs to the switch and to one another you'll need 3 or 4 SFP+ passive Twin-Ax cables of appropriate length. Three if direct server-to-server works, four if it doesn't, in which case you connect all 4 to the 4 SFP+ switch ports. You'll need to contact Intel and inquire about the NIC-to-NIC functionality. I'm not using the word cross-over because I don't believe it applies to Twin-Ax cable. But you need to confirm their NICs will auto negotiate the send/receive pairs. This isn't twisted pair cable Adam. It's a different beast entirely. You can't run the length you want, cut the cable and terminate it yourself. These cables must be pre-made to length and terminated at the factory. One look at the prices tells you that. The 1 meter Intel cable costs more than a 500ft spool of Cat 5e. A 1 meter and a 3 meter Passive Twin-Ax cable, Intel and Netgear: http://www.newegg.com/Product/Product.aspx?Item=N82E16812128002 http://www.newegg.com/Product/Product.aspx?Item=N82E16812638004 If the server to switch distance is much over 15ft you will need to inquire with Intel and Netgear about the possibility of using active Twin-Ax cables. If their products do no support active cables you'll have to go with fiber, and spend the extra $2000 for the 4 transceivers, along with one LC-to-LC multimode fiber cable for the server-to-server link, and two straight through LC-LC multimode fiber cables. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html