On Thu, 14 Aug 2014 06:33:51 +0000 Markus Stockhausen <stockhausen@xxxxxxxxxxx> wrote: > > Von: NeilBrown [neilb@xxxxxxx] > > Gesendet: Donnerstag, 14. August 2014 06:11 > > An: Markus Stockhausen > > Cc: shli@xxxxxxxxxx; linux-raid@xxxxxxxxxxxxxxx > > Betreff: Re: Bigger stripe size > > ... > > > > > > Will it make sense to work with per-stripe sizes? E.g. > > > > > > User reads/writes 4K -> Work on a 4K stripe. > > > User reads/writes 16K -> Work on a 16K stripe. > > > > > > Difficulties. > > > > > > - avoid overlapping of "small" and "big" stripes > > > - split stripe cache in different sizes > > > - Can we allocate multi-page memory to have continous work-areas? > > > - ... > > > > > > Benefits. > > > > > > - Stripe handling unchanged. > > > - paritiy calculation more efficient > > > - ... > > > > > > Other ideas? > > > > I fear that we are chasing the wrong problem. > > > > The scheduling of stripe handling is currently very poor. If you do a large > > sequential write which should map to multiple full-stripe writes, you still > > get a lot of reads. This is bad. > > The reason is that limited information is available to the raid5 driver > > concerning what is coming next and it often guesses wrongly. > > > > I suspect that it can be made a lot cleverer but I'm not entirely sure how. > > A first step would be to "watch" exactly what happens in terms of the way > > that requests come down, the timing of 'unplug' events, and the actual > > handling of stripes. 'blktrace' could provide most or all of the raw data. > > > > Thanks for that info. I did not expect to find so basic challenges in the code ... > Could you explain what you mean with unplug events? Maybe you can give me > the function in raid5.c that would be the right place to understand better how > changed data "leaves" the stripes and puts it on freelists again. When data is submitted to any block device the code normally calls blk_start_plug() and when it has submitted all the requests that it wants to submit it calls blk_end_plug(). If any code ever needs to 'schedule()', e.g. to wait for memory to be freed, and the equivalent of blk_end_plug() is called so that any pending requests are sent in their way. md/raid5 checks if a plug is currently in force using blk_check_plugged(). If it is, then new requests are queued internally and not released until raid5_unplug() is called. The net result of this is to gather multiple small requests together. It helps with scheduling but not completely. There are two important parts to understand in raid5. make_request() is how a request (struct bio) is given to raid5. It finds which stripe_heads to attach it too and does so using add_stripe_bio(). When each strip_head is released (release_stripe()) they are put on a queue (if they are otherwise idle). The second part is handle_stripe(). This is called as needed by raid5d. It plucks a stripe_head off the list, figures out what to do with it, and does it. Once the data has been written return_io() is called on all the bios that are finished with and their owner (e.g. the filesystem) it told that the write (or read) is complete. Each stripe_head represents a 4K strip across all devices. So for an array with 64K chunks, a "full stripe write" requires 16 different stripe_heads to be assembled and worked on. This currently all happens one stripe_head at a time. Once you have digested all that, ask some more questions :-) NeilBrown > > > > > Then determine what the trace "should" look like and come up with a way for > > raid5 too figure that out and do it. > > I suspect that might involve are more "clever" queuing algorithm, possibly > > keeping all the stripe_heads sorted, possibly storing them in an RB-tree. > > > > Once you have that queuing in place so that the pattern of write requests > > submitted to the drives makes sense, then it is time to analyse CPU efficiency > > and find out where double-handling is happening, or when "batching" or > > re-ordering of operations can make a difference. > > If the queuing algorithm collects contiguous sequences of stripe_heads > > together, then processes a batch of them in succession make provide the same > > improvements as processing fewer larger stripe_heads. > > > > So: first step is to get the IO patterns optimal. Then look for ways to > > optimise for CPU time. > > > > NeilBrown > > Markus
Attachment:
signature.asc
Description: PGP signature