On Tue 03-05-16 09:42:40, Chris Mason wrote: > On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote: > > On Tue 03-05-16 08:40:11, Chris Mason wrote: > > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote: > > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote: > > > > > >>- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); > > > > > >>- rwb->wb_normal = (rwb->wb_max + 1) / 2; > > > > > >>- rwb->wb_background = (rwb->wb_max + 3) / 4; > > > > > >>+ if (rwb->queue_depth == 1) { > > > > > >>+ rwb->wb_max = rwb->wb_normal = 2; > > > > > >>+ rwb->wb_background = 1; > > > > > > > > > > > >This breaks the detection of too big scale_step in scale_up() where we key > > > > > >of wb_max == 1 value. However even with that fixed no luck :(: > > > > > > > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for > > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down. > > > > > > > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync > > > > > >Runtime: 105.126 107.125 105.641 > > > > > > > > > > > >So about the same as before. I'll try to debug this later today... > > > > > > > > > > Thanks, I'm very interested in what you find! > > > > > > > > OK, so the reason was relatively standard in the end. I was using ext3 (or > > > > more exactly ext4 without delayed allocation) for the test. The throttling > > > > of background writes gave more priority to writes from the journalling > > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the > > > > journalling thread ended up having to do more data writeback to be able to > > > > commit a transaction (due to requirements of data=ordered mode) and it is > > > > less efficient at that than the normal flusher thread. > > > > > > > > So this is an example where throttling background writeback effectively > > > > just pushes more work into another context which does it less efficiently > > > > and indirectly makes everyone wait for it. ext3 has been always sensitive to > > > > issues like this. ext4 is using delayed allocation and thus only data > > > > writes into holes end up being part of a transaction -> simple dd test case > > > > doesn't hit that path. And indeed when I repeat the same test with ext4, > > > > the numbers with and without your patch are exactly the same. > > > > > > > > The question remains how common a pattern where throttling of background > > > > writeback delays also something else is. I'll schedule a couple of > > > > benchmarks to measure impact of your patches for a wider range of workloads > > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing > > > > issues, I would be willing to accept that ext3 takes the hit since it is > > > > doing something rather stupid (but inherent in its journal design) and we > > > > have a way to deal with this either by enabling delayed allocation or by > > > > turning off the writeback throttling... > > > > > > At least in the case of io that we know is going to be data=ordered, we > > > can bump the prio of those pages? > > > > But how would flusher thread, which is submitting IO, know that? We would > > have to somehow mark inodes that are part of the running transaction and > > flusher thread could give more priority to such writeback - e.g. by using > > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that, > > it could be doable. > > This would be specific to the data=ordered code in the FS. If there's > some way to test for an inode or a page's status in the data=ordered > list, the FS writepages call could flag the IO as higher prio? Oh, right, we could do that. I can experiment with that later. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html