On Thu, May 12, 2011 at 3:47 AM, Milan Broz <mbroz@xxxxxxxxxx> wrote: > Hi, > > On 05/11/2011 09:11 PM, Chris Lais wrote: >> I've recently installed a system with dm-crypt placed over a software >> RAID5 array, and have noticed some very severe issues with write >> performance due to the way dm-crypt works. >> >> Almost all of these problems are caused by dm-crypt re-ordering bios >> to an extreme degree (as shown by blktrace), such that it is very hard >> for the raid layer to merge them in to full stripes, leading to many >> extra reads and writes. There are minor problems with losing >> io_context and seeking for CFQ, but they have far less impact. > > There is no explicit reordering of bios in dmcrypt. > > There are basically two situations were dmcrypt can reorder request: > > First is when crypto layer process request asynchronously > (probably not a case here - according to your system spec you should > be probably using AES-NI, right?) No, the i7-870 does not have AES-NI. > > The second possible reordering can happen if you run 2.6.38 kernel and > above, where the encryption run always on the cpu core which submitted it. > > First thing is to check what's really going on your system and why. > > - What's the io pattern here? Several applications issues writes > in parallel? Can you provide commands how do you tested it? > The I/O pattern is a single dd command, using a block size of 1M or 2M (does not produce a substantial difference). And before you ask, this /is/ one of the more major intended workloads, not a failed attempt at a benchmark. For the purposes of testing, I'm inputting from /dev/zero, but normally it will be from an attached drive, which will sometimes be slower than 180MB/s, and sometimes faster, but will always be substantially faster than 30MB/s. The I/O is being submitted by a dirty background thread, which is jumping cores periodically (and which I don't think I can set the cpu affinity of reliably). I don't know why the caches aren't able to cope without very large cache sizes (and *still* fail to assemble full stripes frequently), unless the switching is happening very often and is splitting between stripes (very likely, with a stripe size of 1MB). Even with perfect splitting (as in the case with a parallel workload with no reordering), the cache size for merging stripes will have to be at least stripe_size*threads. I have to think it we'd get far better performance (for any media with large physical block sizes) keeping the bios for each block/stripe together starting from the upper-most block layer, but the system doesn't seem to be designed in a way that makes this easy at all. dd if=/dev/zero of=test bs=1048576: submitted to dm-crypt layer (top-level) [dm-5]: 254,5 5 419 1.019698208 1533 Q W 761892040 + 8 [flush-254:5] 254,5 5 420 1.019699440 1533 Q W 761892048 + 8 [flush-254:5] 254,5 5 421 1.019700449 1533 Q W 761892056 + 8 [flush-254:5] 254,5 5 422 1.019701510 1533 Q W 761892064 + 8 [flush-254:5] 254,5 5 423 1.019702466 1533 Q W 761892072 + 8 [flush-254:5] 254,5 5 424 1.019703528 1533 Q W 761892080 + 8 [flush-254:5] [snip] 254,5 1 418 1.030607158 1533 Q W 761959960 + 8 [flush-254:5] 254,5 1 419 1.030608679 1533 Q W 761959968 + 8 [flush-254:5] 254,5 1 420 1.030610084 1533 Q W 761959976 + 8 [flush-254:5] 254,5 1 421 1.030611534 1533 Q W 761959984 + 8 [flush-254:5] 254,5 1 422 1.030612991 1533 Q W 761959992 + 8 [flush-254:5] 254,5 1 423 1.030614446 1533 Q W 761960000 + 8 [flush-254:5] [snip] 254,5 3 423 1.062605245 1533 Q W 762049928 + 8 [flush-254:5] 254,5 3 424 1.062606044 1533 Q W 762049936 + 8 [flush-254:5] 254,5 3 425 1.062606853 1533 Q W 762049944 + 8 [flush-254:5] 254,5 3 426 1.062607616 1533 Q W 762049952 + 8 [flush-254:5] 254,5 3 427 1.062609579 1533 Q W 762049960 + 8 [flush-254:5] 254,5 3 428 1.062610503 1533 Q W 762049968 + 8 [flush-254:5] 254,5 3 429 1.062611306 1533 Q W 762049976 + 8 [flush-254:5] 254,5 3 430 1.062612079 1533 Q W 762049984 + 8 [flush-254:5] 254,5 3 431 1.062612851 1533 Q W 762049992 + 8 [flush-254:5] submitted to LVM2 logical volume layer (directly below dm-5) [dm-3]: 254,3 1 34 1.055642427 6282 Q W 761959960 + 8 [kworker/1:2] 254,3 3 39 1.055676830 6402 Q W 762049928 + 8 [kworker/3:0] 254,3 5 35 1.055707355 6349 Q W 761892040 + 8 [kworker/5:1] 254,3 3 40 1.055720657 6402 Q W 762049936 + 8 [kworker/3:0] 254,3 1 35 1.055720737 6282 Q W 761959968 + 8 [kworker/1:2] 254,3 3 41 1.055768875 6402 Q W 762049944 + 8 [kworker/3:0] 254,3 5 36 1.055782164 6349 Q W 761892048 + 8 [kworker/5:1] 254,3 1 36 1.055798939 6282 Q W 761959976 + 8 [kworker/1:2] 254,3 3 42 1.055813807 6402 Q W 762049952 + 8 [kworker/3:0] 254,3 5 37 1.055858505 6349 Q W 761892056 + 8 [kworker/5:1] 254,3 3 43 1.055858595 6402 Q W 762049960 + 8 [kworker/3:0] 254,3 1 37 1.055873828 6282 Q W 761959984 + 8 [kworker/1:2] 254,3 3 44 1.055906790 6402 Q W 762049968 + 8 [kworker/3:0] 254,3 5 38 1.055937878 6349 Q W 761892064 + 8 [kworker/5:1] 254,3 3 45 1.055950798 6402 Q W 762049976 + 8 [kworker/3:0] 254,3 1 38 1.055950939 6282 Q W 761959992 + 8 [kworker/1:2] 254,3 3 46 1.055999370 6402 Q W 762049984 + 8 [kworker/3:0] 254,3 5 39 1.056011893 6349 Q W 761892072 + 8 [kworker/5:1] 254,3 1 39 1.056028144 6282 Q W 761960000 + 8 [kworker/1:2] 254,3 3 47 1.056044505 6402 Q W 762049992 + 8 [kworker/3:0] 254,3 5 40 1.056088439 6349 Q W 761892080 + 8 [kworker/5:1] http://zenthought.org/tmp/dm-crypt+raid5/dm-5,dm-3.single-thread.dd.zero.1M.tar.gz > - Can you test older kernel (2.6.37) and check blktrace? > Does it behave differently (it should - no reordering but all > encryption just on one core.) > > - Also 2.6.39-rc (with flush changes) can have influence here, > if you can test that the problems is still here, it would be nice > (any fix will be based on this version). I will test both of these when I'm able (should be in the next few days), but I suspect 2.6.37 will perform much better if it's doing it on one core with no re-ordering. I'll have to let you know on 2.6.39-rc*. > > Anyway, we need to find what's really going before suggesting any fix. > >> Using RAID5/6 without dm-crypt does /not/ have these problems in my >> setup, even with standard queue sizes, because the raid layer can >> handle the stripe merging when the bios are not so far out of order. >> Using lower RAID levels even with dm-crypt also does not have these >> problems to such an extreme degree, because they don't need >> read-parity-write cycles for partial stripes. > > Ah, so you are suggesting that the problem is caused by read/write > interleaving (parity blocks)? > Or you are talking about degraded mode as well? Yes, it seems to be caused almost entirely by multiple partial stripe writes to the same stripes, leading to extra unnecessary reads and parity calculations (I do suspect that the reads themselves have much more impact on this system, however). I'm not talking about degraded mode (I don't expect that to perform well). > > Milan > > -- Chris -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel