Jens Axboe <jens.axboe@xxxxxxxxxx> writes: > On Wed, Jun 11 2008, Alan D. Brunelle wrote: >> Jens Axboe wrote: >> > On Wed, Jun 11 2008, Alan D. Brunelle wrote: >> >> Dmitri Monakhov wrote: >> >> >> >> Could it be that in the first case you will have merges, thus creating >> >> fewer/larger I/O requests? Running iostat -x during the two runs, and >> >> watching the output is a good first place to start. >> > >> > I think it's mostly down to whether a specific drive is good at doing >> > 124kb writes + 4k seek (and repeat) compared to regular streaming >> > writes. The tested disk was SATA with write back caching, there should >> > be no real command overhead gain in those size ranges. >> > >> >> Probably true, I'd think the iostat -x data would be very helpful though. > > Definitely, the more data the better :). I already asked for blktrace > data, that should give us everything we need. Seems it is hardware issue. Test: I've disabled margeing request logic in __make_request(), and restrict bio->bi_size =< 128 sectors via merge_bvec_fn. ioscheduler:noop For IO patterns 1 CW(continuous writes) write(,, PAGE_SIZE*16) 2 WH(writes with holes) write(,,PAGE_SIZE*15); lseek(,PAGE_SIZE, SEEK_CUR) 3 CR(continuous reads) write(,, PAGE_SIZE*16) 4 RH(read with holes) same as WH but directly send bios in order to explicitly prevent read-ahead logic. I've played with sata disk with NCQ on AHCI, and SCSI disk. Result: For SATA disk Performance drawback caused by restricted bio size was negligible for all io patterns. So this is definitely not queue starvation issue. BIOs sended by pdflush was ordered in all cases(as expected). For all io patterns, except WH case, driver completions was olso ordered. But for on WH io pattern driver seems goes crazy: Dispetched requests: 8,0 1 14 0.000050684 3485 D W 0 + 128 [pdflush] 8,0 1 15 0.000055906 3485 D W 136 + 128 [pdflush] 8,0 1 16 0.000059269 3485 D W 272 + 128 [pdflush] 8,0 1 17 0.000062625 3485 D W 408 + 128 [pdflush] 8,0 1 31 0.000133306 3485 D W 544 + 128 [pdflush] 8,0 1 32 0.000136043 3485 D W 680 + 128 [pdflush] 8,0 1 33 0.000140446 3485 D W 816 + 128 [pdflush] 8,0 1 34 0.000142961 3485 D W 952 + 128 [pdflush] 8,0 1 48 0.000204734 3485 D W 1088 + 128 [pdflush] 8,0 1 49 0.000207358 3485 D W 1224 + 128 [pdflush] 8,0 1 50 0.000209505 3485 D W 1360 + 128 [pdflush] .... Completed request: 8,0 0 1 0.045342874 3907 C W 2856 + 128 [0] 8,0 0 3 0.045374650 3907 C W 2992 + 128 [0] 8,0 0 5 0.057461715 0 C W 1768 + 128 [0] 8,0 0 7 0.057491967 0 C W 1904 + 128 [0] 8,0 0 9 0.060058695 0 C W 680 + 128 [0] 8,0 0 11 0.060075666 0 C W 816 + 128 [0] 8,0 0 13 0.063015540 0 C W 1360 + 128 [0] 8,0 0 15 0.063028859 0 C W 1496 + 128 [0] 8,0 0 17 0.073802939 0 C W 3672 + 128 [0] 8,0 0 19 0.073817422 0 C W 3808 + 128 [0] 8,0 0 21 0.075664013 0 C W 544 + 128 [0] 8,0 0 23 0.078348416 0 C W 1088 + 128 [0] 8,0 0 25 0.078362380 0 C W 1224 + 128 [0] 8,0 0 27 0.089371470 0 C W 3400 + 128 [0] 8,0 0 29 0.089385247 0 C W 3536 + 128 [0] 8,0 0 31 0.092328327 0 C W 272 + 128 [0] .... As you can see completion appears in semi-random order. This happens regardless to enabled/disabled hardware write cache. So this is hardware crap. Note: i've got same performance drawback for mac-mini with MAC Os. Results for SCSI ( bio's size was restricted to 256 sectors): All requests dispatched and completed in normal order, but by unknown reason it takes more time to serve "write with holes" reqests. Disk driver requests completions timeline comparison table write(,, 32 *PG_SZ) || write(,, 31*PG_SZ) ;lseek(,PG_SZ, SET_CUR) --------------------------++--------------------------------- time sector || time sector --------------------------++--------------------------------- 0.001028 131072 + 96 || 0.001020 131072 + 96 0.010916 131176 + 256 || 0.015471 131176 + 152 0.018810 131432 + 256 || 0.022863 131336 + 248 0.020248 131688 + 256 || 0.024771 131592 + 248 0.021674 131944 + 256 || 0.031986 131848 + 248 0.023090 132200 + 256 || 0.039276 132104 + 248 0.024575 132456 + 256 || 0.046587 132360 + 248 0.026069 132712 + 256 || 0.054503 132616 + 248 0.027566 132968 + 256 || 0.061797 132872 + 248 0.029063 133224 + 256 || 0.069087 133128 + 248 0.030558 133480 + 256 || 0.076388 133384 + 248 0.032053 133736 + 256 || 0.083756 133640 + 248 0.033544 133992 + 256 || 0.085657 133896 + 248 0.035042 134248 + 256 || 0.092878 134152 + 248 0.036518 134504 + 256 || 0.100176 134408 + 248 0.038009 134760 + 256 || 0.107473 134664 + 248 0.039510 135016 + 256 || 0.115323 134920 + 248 0.041005 135272 + 256 || 0.122638 135176 + 248 0.042500 135528 + 256 || 0.129933 135432 + 248 0.043992 135784 + 256 || 0.137224 135688 + 248 IMHO it is also hardware issue. > > -- > Jens Axboe > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html