On Tue, Nov 02, 2010 at 01:18:42PM +0100, Christian Ehrhardt wrote: > Hi, > this is about an issue newer kernels show, bysplitting direct I/O requests > into 4k pieces to directly merge them in the Block Device Layer afterwards. > > If anyone is interested in own tests just use a simple command like > dd if=/mnt/test/test-dd1 of=/dev/null iflag=direct bs=64k count=1 > in combination with blktrace. > > The following patch is more a proposal for discussion than a solution, well > thats what RFC's are about right. > I'm unsure about names, but also if the approach in general is the right way. > It should apply to every 2.6.36 and 2.6.37 kernel. > > I put everyone on CC who was involved in the patches leading to the current > behavior. > > Grüsse / regards, > Christian Ehrhardt > IBM Linux Technology Center, System z Linux Performance > > --- cut here --- > > Subject: [RFC][PATCH] direct-io: btrfs: avoid splitting dio requests for non-btrfs filesystems > > From: Christian Ehrhardt <ehrhardt@xxxxxxxxxxxxxxxxxx> > > Commit c2c6ca41 by Josef Bacik <josef@xxxxxxxxxx> caused all direct I/O's to > be split into 4k requests before arriving in the block device layer. > This was later on partially fixed by Jeff Moyer <jmoyer@xxxxxxxxxx> in > 7a801ac6. > > Jeffs fix improved the situation a lot, but eventually it still splits I/Os > for non-btrfs file systems as well were it wouldn't have to. > > Eventually in my example on a ext2 filesystem it splits it every 4Mb where > dio->boundary is evaluated to true. > > In blktrace this looks like: > dd-910 [002] 38.762523: 94,8 A R 131264 + 8 <- (94,9) 131072 > dd-910 [002] 38.762531: 94,8 Q R 131264 + 8 [dd] > dd-910 [002] 38.762535: 94,8 G R 131264 + 8 [dd] > dd-910 [002] 38.762537: 94,8 P N [dd] > dd-910 [002] 38.762539: 94,8 I R 131264 + 8 [dd] > dd-910 [002] 38.762544: 94,8 A R 131272 + 8 <- (94,9) 131080 > dd-910 [002] 38.762544: 94,8 Q R 131272 + 8 [dd] > dd-910 [002] 38.762546: 94,8 M R 131272 + 8 [dd] > dd-910 [002] 38.762550: 94,8 A R 131280 + 8 <- (94,9) 131088 > dd-910 [002] 38.762551: 94,8 Q R 131280 + 8 [dd] > dd-910 [002] 38.762551: 94,8 M R 131280 + 8 [dd] > dd-910 [002] 38.762556: 94,8 A R 131288 + 8 <- (94,9) 131096 > dd-910 [002] 38.762557: 94,8 Q R 131288 + 8 [dd] > dd-910 [002] 38.762557: 94,8 M R 131288 + 8 [dd] > dd-910 [002] 38.762562: 94,8 A R 131296 + 8 <- (94,9) 131104 > dd-910 [002] 38.762563: 94,8 Q R 131296 + 8 [dd] > dd-910 [002] 38.762564: 94,8 M R 131296 + 8 [dd] > dd-910 [002] 38.762568: 94,8 A R 131304 + 8 <- (94,9) 131112 > dd-910 [002] 38.762569: 94,8 Q R 131304 + 8 [dd] > dd-910 [002] 38.762570: 94,8 M R 131304 + 8 [dd] > dd-910 [002] 38.762577: 94,8 A R 131312 + 8 <- (94,9) 131120 > dd-910 [002] 38.762578: 94,8 Q R 131312 + 8 [dd] > dd-910 [002] 38.762579: 94,8 M R 131312 + 8 [dd] > dd-910 [002] 38.762584: 94,8 A R 131320 + 8 <- (94,9) 131128 > dd-910 [002] 38.762584: 94,8 Q R 131320 + 8 [dd] > dd-910 [002] 38.762585: 94,8 M R 131320 + 8 [dd] > dd-910 [002] 38.762590: 94,8 A R 131328 + 8 <- (94,9) 131136 > dd-910 [002] 38.762590: 94,8 Q R 131328 + 8 [dd] > dd-910 [002] 38.762591: 94,8 M R 131328 + 8 [dd] > dd-910 [002] 38.762596: 94,8 A R 131336 + 8 <- (94,9) 131144 > dd-910 [002] 38.762597: 94,8 Q R 131336 + 8 [dd] > dd-910 [002] 38.762598: 94,8 M R 131336 + 8 [dd] > dd-910 [002] 38.762605: 94,8 A R 131344 + 16 <- (94,9) 131152 > dd-910 [002] 38.762607: 94,8 Q R 131344 + 16 [dd] > dd-910 [002] 38.762608: 94,8 M R 131344 + 16 [dd] > dd-910 [002] 38.762611: 94,8 A R 131368 + 32 <- (94,9) 131176 > dd-910 [002] 38.762612: 94,8 Q R 131368 + 32 [dd] > dd-910 [002] 38.762616: 94,8 G R 131368 + 32 [dd] > dd-910 [002] 38.762617: 94,8 I R 131368 + 32 [dd] > dd-910 [002] 38.762619: 94,8 U N [dd] 2 > dd-910 [002] 38.762621: 94,8 D R 131264 + 96 [dd] > dd-910 [002] 38.762625: 94,8 D R 131368 + 32 [dd] > <idle>-0 [012] 38.763363: 94,8 C R 131264 + 96 [0] > <idle>-0 [015] 38.763797: 94,8 C R 131368 + 32 [0] > > The usual behavior before both commits was: > dd-919 [002] 37.513685: 94,8 A R 7824 + 96 <- (94,9) 7632 > dd-919 [002] 37.513693: 94,8 Q R 7824 + 96 [dd] > dd-919 [002] 37.513697: 94,8 G R 7824 + 96 [dd] > dd-919 [002] 37.513700: 94,8 P N [dd] > dd-919 [002] 37.513701: 94,8 I R 7824 + 96 [dd] > dd-919 [002] 37.513794: 94,8 A R 7928 + 32 <- (94,9) 7736 > dd-919 [002] 37.513795: 94,8 Q R 7928 + 32 [dd] > dd-919 [002] 37.513800: 94,8 G R 7928 + 32 [dd] > dd-919 [002] 37.513802: 94,8 I R 7928 + 32 [dd] > dd-919 [002] 37.513804: 94,8 U N [dd] 2 > dd-919 [002] 37.513806: 94,8 D R 7824 + 96 [dd] > dd-919 [002] 37.513810: 94,8 D R 7928 + 32 [dd] > <idle>-0 [011] 37.514362: 94,8 C R 7824 + 96 [0] > <idle>-0 [014] 37.514728: 94,8 C R 7928 + 32 [0] > > That remaining split is cause by the test for: > "dio->final_block_in_bio != dio->cur_page_block". > As this was in the code for a long time I just assume it is right. > > So eventually for the 64k request in the example this patch improves the > effective submissions that get to the block device layer from: > 10x4k, 1x8k, 1x16k to 1x48k & 1x16k which is much better. > > Througput impact is small, but in terms of cpu consumption this is visible > by a single digit percentage depending on the incoming request size. > > The solution looking for comments or alternatives in this RFC patch adds a new > kiocb flag that let filesystems specify if they need these workaround to > separate meta data reads - if not, like all pre-btrfs filesystems the dio code > doesn't have to cause this extra work. > So this brings up an important question, which is how badly do we want to honor buffer_boundary()? The whole reason I added the logical offset check was because we ignore buffer_boundary() if there is no bio currently. So for BTRFS we would do this map extent set buffer boundary but if this is the first part of the IO and there isn't a bio already setup, dio_new_bio clears dio->boundary, so the next extent we would get may not be logically contiguous and hilarity would ensue. I chose not to fix the dio->boundary clearing because I was worried I would affect other fs's workloads (which I did anyway because of my bug). So maybe the right idea is to rip out my logical offset tests altogether and fix dio so we treat buffer_boundary() like gospel. That way Btrfs can get what it needs without having this weird special code, and then we can look at how other fs's set buffer_boundary (I'm pretty sure ext2/3 are the only ones) and make sure they are setting it when they really mean to. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html