On 8/23/2016 10:54 AM, Matt Garman wrote: > On Tue, Aug 16, 2016 at 11:36 AM, Doug Dumitru <doug@xxxxxxxxxx> wrote: >> The RAID rebuild for a single bad drive "should" be an XOR and should run at >> 200,000 kb/sec (the default speed_limit_max). I might be wrong on this and >> this might still need a full RAID-6 syndrome compute, but I dont think so. >> >> The rebuild might not hit 200MB/sec if the drive you replaced is >> "conditioned". Be sure to secure erase any non-new drive before you replace >> it. >> >> Your read IOPS will compete with now busy drives which may increase the IO >> latency a lot, and slow you down a lot. >> >> One out of 22 read OPS will be to the bad drive, so this will now take 22 >> reads to re-construct the IO. The reconstruction is XOR, so pretty cheap >> from a CPU point of view. Regardless, your IOPS total will double. >> >> You can probably mitigate the amount of degradation by lowering the rebuild >> speed, but this will make the rebuild take longer, so you are messed up >> either way. If the server has "down time" at night, you might lower the >> rebuild to a really small value during the day, and up it at night. > > OK, right now I'm looking purely at performance in a degraded state, > no rebuild taking place. > > We have designed a simple read load test to simulate the actual > production workload. (It's not perfect of course, but a reasonable > approximation. I can share with the list if there's interest.) But > basically it just runs multiple threads of reading random files > continuously. > > When the array is in a pristine state, we can achieve read throughput > of 8000 MB/sec (at the array level, per iostat with 5 second samples). > > Now I failed a single drive. Running the same test, read performance > drops all the way down to 200 MB/sec. > > I understand that IOPS should double, which to me says we should > expect a roughly 50% read performance drop (napkin math). But this is > a drop of over 95%. > > Again, this is with no rebuild taking place... > > Thoughts? This depends a lot on how you structured your raid array. I didn't see your earlier emails, so I'm inferring from the "one out of 22 reads will be to the bad drive" that you have a 24 disk raid6 array? If so, then that's 22 data disks and 2 parity disks per stripe. I'm gonna use that as the basis for my next statement even if it's slightly wrong. Doug was right in that you will have to read 21 data disks and 1 parity disk to reconstruct reads from the missing block of any given stripe. And while he is also correct that this doubles IO ops needed to get your read data, it doesn't address the XOR load to get your data. With 19 data disks and 1 parity disk, and say a 64k chunk size, you have to XOR 20 64k data blocks for 1 result. If you are getting 200MB/s, you are actually achieving more like 390MB/s of data read, with 190MB/s of it being direct reads, and then you are using XOR on 200MB/s in order to generate the other 10MB/s of results. The question of why that performance is so bad is probably (and I say probably because without actually testing it this is just some hand-wavy explanation based upon what I've tested and found in the past, but may not be true today) due to a couple factors: 1) 200MB/s of XOR is not insignificant. Due to our single thread XOR routines, you can actually keep a CPU pretty busy with this. Also, even though the XOR routines try to time their assembly 'just so' so that they can use the cache avoiding instructions, this fails more often than not so you end up blowing CPU caches while doing this work, which of course effects the overall system. Possible fixes for this might include: a) Multi-threaded XOR becoming the default (last I knew it wasn't, correct me if I'm wrong) b) Improved XOR routines that deal with cache more intelligently c) Creating a consolidated page cache/stripe cache (if we can read more of the blocks needed to get our data from cache instead of disk it helps reduce that IO ops issue) d) Rearchitecting your arrays into raid50 instead of big raid6 array 2) Even though we theoretically doubled IO ops, we haven't addressed whether or not that doubling is done efficiently. Testing would be warranted here to make sure that our reads for reconstruction aren't negatively impacting overall disk IO op capability. We might be doing something that we can fix, such as interfering with merges or with ordering or with latency sensitive commands. A person would need to do some deep inspection of how commands are being created and sent to each device in order to see if we are keeping them busy or our own latencies at the kernel level are leaving the disks idle and killing our overall throughput (or conversely has the random head seeks just gone so radically through the roof that the problem here really is the time it takes the heads to travel everywhere we are sending them). -- Doug Ledford <dledford@xxxxxxxxxx> GPG Key ID: 0E572FDD
Attachment:
signature.asc
Description: OpenPGP digital signature