Hi Dallas, On 12/15/2015 12:30 PM, Dallas Clement wrote: > Thanks guys for all the ideas and help. > > Phil, > >> Very interesting indeed. I wonder if the extra I/O in flight at high >> depths is consuming all available stripe cache space, possibly not >> consistently. I'd raise and lower that in various combinations with >> various combinations of iodepth. Running out of stripe cache will cause >> premature RMWs. > > Okay, I'll play with that today. I have to confess I'm not sure that > I completely understand how the stripe cache works. I think the idea > is to batch I/Os into a complete stripe if possible and write out to > the disks all in one go to avoid RMWs. Other than alignment issues, > I'm unclear on what triggers RMWs. It seems like as Robert mentioned > that if the I/Os block size is stripe aligned, there should never be > RMWs. > > My stripe cache is 8192 btw. > Stripe cache is the kernel's workspace to compute parity or to recover data from parity. It works on 4k blocks. Per "man md", the units are number of such blocks per device. *The blocks in each cache stripe are separated from each other on disk by the chunk size*. Let's examine some scenarios for your 128k chunk size, 12 devices. You have 8192 cache stripes of 12 blocks each: 1) Random write of 16k. 4 stripes will be allocated from the cache for *all* of the devices, and filled for the devices written. The raid5 state machine lets them sit briefly for a chance for more writes to the other blocks in each stripe. 1a) If none come in, MD will request a read of the old data blocks and the old parities. When those arrive, it'll compute the new parities and write both parities and new data blocks. Total I/O: 32k read, 32k write. 1b) If other random writes come in for those stripes, chunk size spaced, MD will wait a bit more. Then it will read in any blocks that weren't written, compute parity, and write all the new data and parity. Total I/O: 16k * n, possibly some reads, the rest writes. 2) Sequential write of stripe-aligned 1408k. The first 128k allocates 64 cache stripes and fills their first block. The next 128k fills the second block of each cache stripe. And so on, filling all the data blocks in the cache stripes. MD shortly notices a full cache stripe write on each, so it just computes the parities and submits all of those writes. 3) Sequential write of 256k, aligned or not. As above, but you only fill two blocks in each cache stripe. MD then reads 1152k, computes parity, and writes 384k. 4) Multiple back-to-back writes of 1408k aligned. First grabs 64 cache stripes and shortly queues all of those writes. Next grabs another 64 cache stripes and queues more writes. And then another 64 caches stripes and writes. Underlying layer, as its queue grows, notices the adjacency of chunk writes from multiple top-level writes and starts merging. Stripe caches are still held, though, until each write is completed. If 128 top-level writes are in flight (8192/64), you've exhausted your stripe cache. Note that this is writes in flight in your application *and* writes in flight from anything else. Keeping in mind that merging might actually raise the completion latency for the earlier writes. I'm sure you can come up with more. The key is that stripe parity calculations must be performed on blocks separated on disk by the chunk size. Really big chunk sizes don't actually help parity raid, since everything is broken down to 4k for the stripe cache, then re-merged underneath it. > I with this were for fun! ;) Although this has been a fun discussion. > I've learned a ton. This effort is for work though. I'd be all over > the SSDs and caching otherwise. I'm trying to characterize and then > squeeze all of the performance I can out of a legacy NAS product. I > am constrained by the existing hardware. Unfortunately I do not have > the option of using SSDs or hardware RAID controllers. I have to rely > completely on Linux RAID. > > I also need to optimize for large sequential writes (streaming video, > audio, large file transfers), iSCSI (mostly used for hosting VMs), and > random I/O (small and big files) as you would expect with a NAS. On spinning rust, once you introduce any random writes, you've effectively made the entire stack a random workload. This is true for all raid levels, but particularly true for parity raid due to the RMW cycles. If you really need great sequential performance, you can't allow the VMs and the databases and small files on the same disks. That said, I recommend a parity raid chunk size of 16k or 32k for all workloads. Greatly improves spatial locality for random writes, reduces stripe cache hogging for sequential writes, and doesn't hurt sequential reads too much. Phil -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html