On Tuesday September 6, dan.j.williams@xxxxxxxxx wrote: > Hello, > > I am writing to the list to gauge interest for a modification of the > md driver that allows it to take advantage of raid acceleration > hardware. I/O processors like the Intel IOP333 > (http://www.intel.com/design/iio/docs/iop333.htm) contain an xor > engine for raid5 and raid6 calculations, but currently the md driver > does not fully utilize these resources. > > Dave Jiang wrote a driver that re-routed calls to xor_block() to use > the hardware xor engine. However, from my understating, he found that > performance did not improve, due to the fact that md deals in > PAGE_SIZE (4K) blocks. At 4K the overhead of setting up the engine > destroys any performance advantage over a software xor. The goal of > the modification would be to enable md to understand the capacity of > the platform's xor resources and allow it to issue optimal block > sizes. > > The first question is whether a solution along these lines would be > valued by the community? The effort is non-trivial. If the effort is non-trivial, then I suggest you only do it if it has really value to *you*. If it does, community involvement is more likely to provide value *to* you (such as guidance, bug-fixes, long-term maintenance) than to get value *from* you, though hopefully it would be a win-win situation. I'm not surprised that simply replacing xor_block with calls into the hardware engine didn't help much. xor_block is currently called under a spinlock, so the main processor will probably be completely idle while the AA is doing the XOR calculation, so there isn't much room for improvement. If I were to try to implement this, here is how I would do it: 1/ get the xor calc out from under the spinlock. This will require a fairly deep understanding of the handle_stripe() function. The 'stripe_head' works somewhat like a state machine. handle_stripe assesses the current state and advances it one step. Currently if it determines that it is time to write some data, it will - copy data out of file-system buffers into it's own cache - perform the xor calculations in the cache, locking all blocks the then become dirty. - schedule a write on all those locked blocks. The stripe won't be ready to be handled again until all the writes complete. This should be changed so that we don't copy+xor, but instead just lock the blocks and flag them as needing xor. Then after sh->lock is dropped you will send the copy+xor request to the AA, or do it in-line. Once the copy+xor is completed, the stripe needs to get flagged for handling again. Stripe handle will then need to notice that parity has been calculated, so writing can commence. 2/ Then I would try to find the best internal API to provide for the AA (Application Accelerator for those who haven't read the spec yet). My guess is that it should work much like the crypto API. I'm not up-to-date with that so I don't know if the async-crypto-API is complete and merged yet (async-crypto-API is for sending data to separate processors for crypto manipulation and being alerted asynchronously when they complete). If it is, definitely look into using that. If it isn't, certainly look into it and maybe even help it's development to make sure it can handle multiple-input xor operations. Step 1 is probably quite useful anyway and is unlikely to slow current performance - just re-arrange it. Once that is done, plugging in async xor should be fairly easy whether you use crypto-api or not. I don't think it is practical to use larger block sizes for the xor operations, and I doubt it is needed. The DMA engine in the AA has a very nice chaining arrangement where new operations can be added to the end of the chain at any time, and I doubt the effort of loading a new chain descriptor would be a substantial fraction of the time it takes to xor a 4k block. As long as you keep everything async (i.e. keep the main processor busy while the copy+xor is happening) you should notice some speed-up ... or at least a drop is cpu activity. One last note: if you do decide to give '1/' a try, remember to keep patches small and well defined. handle_stripe currently does xor in four places: two in compute_parity (one prior to write, one for resync) and two in compute_block (one for degraded read, one for recovery). Don't try to change these all at once. One, or at most two, at a time makes the patches much easier to review. Good luck, NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html