Re: raid5 write latency is 10x the drive latency

NeilBrown <neilb@xxxxxxx> · Wed, 5 Mar 2014 15:58:11 +1100

On Mon, 3 Mar 2014 15:28:02 +0200 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
wrote:

> Hi Neil,
> 
> We are testing a fully random 8K write IOMETER workload on a raid5 md,
> composed of 5 drives. Initial resync is fully completed.
> 
> We see that the write latency that the MD device demonstrates is 10
> times the latency of individual drives. Typical iostat output is:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> dm-28             0.00     0.00  235.05  883.51   940.21  3448.97
> 7.85    77.88   62.86   50.68   66.10   0.67  75.46
> dm-30             0.00     0.00  230.93  881.44   923.71  3440.72
> 7.85    68.90   55.31   59.45   54.22   0.70  77.94
> dm-32             0.00     0.00  185.57  846.39   742.27  3300.52
> 7.84    59.39   51.10   44.98   52.44   0.56  57.73
> dm-33             0.00     0.00  232.99  864.95   931.96  3374.74
> 7.85    46.72   34.92   31.06   35.96   0.45  49.48
> dm-35             0.00     0.00  265.98  895.88  1063.92  3498.45
> 7.85    62.14   46.43   41.09   48.01   0.56  65.15
> md4               0.00     0.00    0.00  263.92     0.00 12338.14
> 93.50     0.00    0.00    0.00    0.00   0.00   0.00
> dm-37             0.00     0.00    0.00  263.92     0.00 12338.14
> 93.50   128.54  505.81    0.00  505.81   3.89 102.68
> 
> dm-28,30,32,33,35 are MD individual devices.
> dm-37 is a linear Device Mapper device stacked on top of md4, to see
> the MD performance (for some reason MD doesn't show r/w latencies in
> iostat).
> 
> I understand that with RMW that raid5 is doing, we can expect 2x of
> the drive latency (1x to load the stipe-head, 1x to update the
> required stripe-head blocks on disk). However, I am trying to
> understand where the rest of the latency is coming from.
> 
> One thing I see is that with such random workload, raid5d thread is
> busy updating the bitmap in stack like this:
> [<ffffffff81569f05>] md_super_wait+0x55/0x90
> [<ffffffff815709e8>] bitmap_unplug.part.21+0x158/0x160
> [<ffffffff81570a12>] bitmap_unplug+0x22/0x30
> [<ffffffffa0702f57>] raid5d+0xe7/0x570 [raid456]
> [<ffffffff8156344d>] md_thread+0x10d/0x140
> [<ffffffff8107f050>] kthread+0xc0/0xd0
> [<ffffffff816f61ec>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> So I understand that given a single-threaded nature of raid5, when
> raid5d() is doing the bitmap update, it delays the processing of all
> stripe-heads.
> 
> So I commented out the bitmap_unplug call, just to see if there is
> some other problem. Things are better now, but still:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> dm-27             0.00     0.00  254.74 1317.89  1018.95  5271.58
> 8.00   137.68   87.55   92.98   86.50   0.49  77.05
> dm-29             0.00     0.00  269.47 1345.26  1077.89  5381.05
> 8.00   146.44  104.24   81.36  108.82   0.52  83.79
> dm-32             0.00     0.00  267.37 1345.26  1069.47  5381.05
> 8.00   130.10   94.34   73.72   98.44   0.48  76.63
> dm-33             0.00     0.00  267.37 1343.16  1069.47  5372.63
> 8.00    87.61   54.40   52.19   54.84   0.34  55.16
> dm-36             0.00     0.00  271.58 1355.79  1086.32  5423.16
> 8.00   136.33   97.45   81.77  100.59   0.47  76.63
> md4               0.00     0.00    0.00  454.74     0.00 21911.58
> 96.37     0.00    0.00    0.00    0.00   0.00   0.00
> dm-37             0.00     0.00    0.00  454.74     0.00 21911.58
> 96.37   132.05  330.82    0.00  330.82   2.31 105.26

Peter already hinted something along these lines, but to make it explicit:

Consider the wkB/s column:  The number for md4 is about the same as the sum
 of any 4 numbers for dm-27..dm-36.  This is expected as the writes will be
 spread evenly and parity writes will be added.
Now consider the rkB/s column in comparison.  We see about 1/5 the reads to
writes.  For a read-modify-write cycle you would see equal numbers.  The 1:5
ratio suggests that on average one device is read for each whole-stripe write.

Now look at w/s.  Each device sees about 1300 write requests.  md4 sees about
450.  So the array sees 4 times the kB/sec, but 1/3 the ops per sec.  This
averages out at the operations seen by the array being 12 times the
size of the requests seen by the devices.  The device requests are at least
4K, so the array requests are at least 48K.
A 48K write to a 5-drive raid5 with 8k chunks will fill 1.5 full stripes.  No
wonder most of the writes don't require a read.

I don't really know how to interpret the *await fields.  When I care about
latency I prefer to run blktrace and extract out the actual events I'm
interested in.  Then find the latency for each and graph them,

By the way, commenting out bitmap_unplug effectively removes the bitmap
protection.  It allows data to be written while the bit is still clear in the
bitmap so in the case of a crash the resync might not fix everything.

NeilBrown

> 
> So it looks like about ~100ms are lost somewhere (assuming ~100ms +
> ~100ms for reading and writing the stripe head).
> 
> I tried to profile raid5d looking for some other blocking operation it
> might be doing. And I don't see one. Typically - without the bitmap
> update - raid5 call takes 400-500us, so I don't understand how the
> additional ~100ms of latency is gained. (With the bitmap update,
> raid5d is significantly delayed by it, which delayes all the
> processing). Do you have an idea why there is a latency difference?
> 
> Thanks,
> Alex.

Attachment:
signature.asc

Description: PGP signature