> > > > > @@ -585,7 +597,8 @@ void writeback_inodes_wb(struct bdi_writ > > > > > if (!wbc->wb_start) > > > > > wbc->wb_start = jiffies; /* livelock avoidance */ > > > > > spin_lock(&inode_wb_list_lock); > > > > > - if (!wbc->for_kupdate || list_empty(&wb->b_io)) > > > > > + > > > > > + if (list_empty(&wb->b_io)) > > > > > queue_io(wb, wbc); > > > > > > > > > > while (!list_empty(&wb->b_io)) { > > > > > @@ -612,7 +625,7 @@ static void __writeback_inodes_sb(struct > > > > > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > > > > > > > > > spin_lock(&inode_wb_list_lock); > > > > > - if (!wbc->for_kupdate || list_empty(&wb->b_io)) > > > > > + if (list_empty(&wb->b_io)) > > > > > queue_io(wb, wbc); > > > > > writeback_sb_inodes(sb, wb, wbc, true); > > > > > spin_unlock(&inode_wb_list_lock); > > > > > > > > That changes the order in which we queue inodes for writeback. > > > > Instead of calling every time to move b_more_io inodes onto the b_io > > > > list and expiring more aged inodes, we only ever do it when the list > > > > is empty. That is, it seems to me that this will tend to give > > > > b_more_io inodes a smaller share of writeback because they are being > > > > moved back to the b_io list less frequently where there are lots of > > > > other inodes being dirtied. Have you tested the impact of this > > > > change on mixed workload performance? Indeed, can you starve > > > > writeback of a large file simply by creating lots of small files in > > > > another thread? > > > Yeah, this change looks suspicious to me as well. > > > > The exact behaviors are indeed rather complex. I personally feel the > > new "always refill iff empty" policy more consistent, clean and easy > > to understand. > > That may be so, but that doesn't make the change good from an IO > perspective. You said you'd only done light testing, and that's not > sufficient to guage the impact of such a change. > > > It basically says: at each round started by a b_io refill, setup a > > _fixed_ work set with all current expired (or all currently dirtied > > inodes if non is expired) and walk through it. "Fixed" work set means > > no new inodes will be added to the work set during the walk. When a > > complete walk is done, start over with a new set of inodes that are > > eligible at the time. > > Yes, I know what it does - I can read the code. You haven't however, > answered why it is a good change from an IO persepctive, however. > > > The figure in page 14 illustrates the "rounds" idea: > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/linux-writeback-queues.pdf > > > > This procedure provides fairness among the inodes and guarantees each > > inode to be synced once and only once at each round. So it's free from > > starvations. > > Perhaps you should add some of this commentary to the commit > message? That talks about the VM and LRU writeback, but that has > nothing to do with writeback fairness. The commit message or > comments in the code need to explain why something is being > changed.... OK, added to changelog. > > > > If you are worried about performance, here is a simple tar+dd benchmark. > > Both commands are actually running faster with this patchset: > ..... > > The base kernel is 2.6.39-rc3+ plus IO-less patchset plus large write > > chunk size. The test box has 3G mem and runs XFS. Test script is: > > <sigh> > > The numbers are meaningless to me - you've got a large number of > other changes that are affecting writeback behaviour, and that's > especially important because, at minimum, the change in write chunk > size will hide any differences in IO patterns that this change will The previous benchmarks are sure valuable and more future proof, assuming that we are going to do IO-less and larger writeback soon. > make. Please test against a vanilla kernel if that is what you are > aiming these patches for. If you aren't aiming for a vanilla kernel, > please say so in the patch series header... Here are the test results for vanilla kernel. It's again shows better numbers for dd, tar and overall run time. 2.6.39-rc3 2.6.39-rc3-dyn-expire+ ------------------------------------------------ all elapsed 256.043 252.367 stddev 24.381 12.530 tar elapsed 30.097 28.808 dd elapsed 13.214 11.782 wfg /tmp% g cpu log-no-moving-expire-vanilla log-moving-expire-vanilla|g tar log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.59s user 4.00s system 47% cpu 35.221 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.62s user 4.19s system 51% cpu 32.358 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.43s user 4.11s system 51% cpu 32.356 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.28s user 4.09s system 60% cpu 26.914 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.25s user 4.12s system 59% cpu 27.345 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.55s user 4.21s system 63% cpu 26.347 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.39s user 3.97s system 44% cpu 36.360 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.44s user 3.88s system 58% cpu 28.046 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.40s user 4.09s system 56% cpu 29.000 total log-no-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.50s user 3.95s system 60% cpu 27.020 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.44s user 4.03s system 56% cpu 28.939 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.63s user 4.06s system 56% cpu 29.488 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.43s user 3.95s system 51% cpu 31.666 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.46s user 3.99s system 63% cpu 25.768 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.14s user 4.26s system 54% cpu 29.838 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.43s user 4.09s system 63% cpu 25.855 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.61s user 4.36s system 57% cpu 29.588 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.36s user 4.13s system 63% cpu 25.816 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.49s user 3.94s system 55% cpu 29.499 total log-moving-expire-vanilla:tar jxf /dev/shm/linux-2.6.38.3.tar.bz2 12.53s user 3.92s system 51% cpu 31.625 total wfg /tmp% g cpu log-no-moving-expire-vanilla log-moving-expire-vanilla|g dd log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.34s system 9% cpu 14.084 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.27s system 8% cpu 14.240 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 9% cpu 13.437 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.21s system 9% cpu 12.783 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.23s system 9% cpu 12.614 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 9% cpu 12.733 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 10% cpu 12.438 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.21s system 9% cpu 12.356 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.21s system 8% cpu 14.724 total log-no-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 9% cpu 12.734 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.57s system 13% cpu 12.002 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.30s system 9% cpu 14.049 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.36s system 11% cpu 12.031 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 10% cpu 11.679 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.26s system 11% cpu 11.276 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.25s system 10% cpu 11.501 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.20s system 10% cpu 11.344 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.24s system 10% cpu 11.345 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.27s system 11% cpu 11.280 total log-moving-expire-vanilla:dd if=/dev/zero of=/fs/zero bs=1M count=1000 0.00s user 1.22s system 10% cpu 11.312 total wfg /tmp% g elapsed log-no-moving-expire-vanilla log-moving-expire-vanilla log-no-moving-expire-vanilla:elapsed: 317.59000000000196 log-no-moving-expire-vanilla:elapsed: 269.16999999999825 log-no-moving-expire-vanilla:elapsed: 271.61000000000058 log-no-moving-expire-vanilla:elapsed: 233.08000000000175 log-no-moving-expire-vanilla:elapsed: 238.20000000000073 log-no-moving-expire-vanilla:elapsed: 240.68999999999505 log-no-moving-expire-vanilla:elapsed: 257.43000000000029 log-no-moving-expire-vanilla:elapsed: 249.45000000000437 log-no-moving-expire-vanilla:elapsed: 251.55000000000291 log-no-moving-expire-vanilla:elapsed: 231.65999999999622 log-moving-expire-vanilla:elapsed: 270.54999999999927 log-moving-expire-vanilla:elapsed: 254.34000000000015 log-moving-expire-vanilla:elapsed: 248.61000000000058 log-moving-expire-vanilla:elapsed: 238.18000000000029 log-moving-expire-vanilla:elapsed: 263.5 log-moving-expire-vanilla:elapsed: 234.15999999999985 log-moving-expire-vanilla:elapsed: 266.81000000000131 log-moving-expire-vanilla:elapsed: 238.14999999999782 log-moving-expire-vanilla:elapsed: 263.14999999999782 log-moving-expire-vanilla:elapsed: 246.22000000000116 > Anyway, I'm going to put some numbers into a hypothetical steady > state situation to demonstrate the differences in algorithms. > Let's say we have lots of inodes with 100 dirty pages being created, > and one large writeback going on. We expire 8 new inodes for every > 1024 pages we write back. > > With the old code, we do: > > b_more_io (large inode) -> b_io (1l) > 8 newly expired inodes -> b_io (1l, 8s) > > writeback large inode 1024 pages -> b_more_io > > b_more_io (large inode) -> b_io (8s, 1l) > 8 newly expired inodes -> b_io (8s, 1l, 8s) > > writeback 8 small inodes 800 pages > 1 large inode 224 pages -> b_more_io > > b_more_io (large inode) -> b_io (8s, 1l) > 8 newly expired inodes -> b_io (8s, 1l, 8s) > ..... > > Your new code: > > b_more_io (large inode) -> b_io (1l) > 8 newly expired inodes -> b_io (1l, 8s) > > writeback large inode 1024 pages -> b_more_io > (b_io == 8s) > writeback 8 small inodes 800 pages > > b_io empty: (1800 pages written) > b_more_io (large inode) -> b_io (1l) > 14 newly expired inodes -> b_io (1l, 14s) > > writeback large inode 1024 pages -> b_more_io > (b_io == 14s) > writeback 10 small inodes 1000 pages > 1 small inode 24 pages -> b_more_io (1l, 1s(24)) > writeback 5 small inodes 500 pages > b_io empty: (2548 pages written) > b_more_io (large inode) -> b_io (1l, 1s(24)) > 20 newly expired inodes -> b_io (1l, 1s(24), 20s) > ...... > > Rough progression of pages written at b_io refill: > > Old code: > > total large file % of writeback > 1024 224 21.9% (fixed) > > New code: > total large file % of writeback > 1800 1024 ~55% > 2550 1024 ~40% > 3050 1024 ~33% > 3500 1024 ~29% > 3950 1024 ~26% > 4250 1024 ~24% > 4500 1024 ~22.7% > 4700 1024 ~21.7% > 4800 1024 ~21.3% > 4800 1024 ~21.3% > (pretty much steady state from here) > > Ok, so the steady state is reached with a similar percentage of > writeback to the large file as the existing code. Ok, that's good, > but providing some evidence that is doesn't change the shared of > writeback to the large should be in the commit message ;) > > The other advantage to this is that we always write 1024 page chunks > to the large file, rather than smaller "whatever remains" chunks. I > think this will have a bigger effect on a vanilla kernel than on the > kernel you tested on above because of the smaller writeback chunk > size. Good analyze! I've included them to the changelog :) > I'm convinced that the refilling only when the queue is empty is a > sane change now. you need to separate this from the > move_expired_inodes() changes because it is doing something very > different to writeback. OK. It actually depends on the patch "writeback: try more writeback as long as something was written". So I'll include it as the last one in next post. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html