On Mon, 1 Mar 2010 22:06:19 -0500 tytso@xxxxxxx wrote: > On Mon, Mar 01, 2010 at 01:34:35PM +0100, Michal Schmidt wrote: > > mpage_da_submit_io() may process tens of thousands of pages at a > > time. Unless full preemption is enabled, it causes scheduling > > latencies in the order of tens of milliseconds. > > > > It can be reproduced simply by writing a big file on ext4 > > repeatedly with dd if=/dev/zero of=/tmp/dummy bs=10M count=50 > > > > The patch fixes it by allowing to reschedule in the loop. > > Have you tested for any performance regressions as a result of this > patch, using some file system benchmarks? I used the 'fio' benchmark to test sequential write speed. Here are the results: Test kernel aggregate bandwidth ------------------------------------------------------ hdd-multi 2.6.33.nopreempt 32.7 ± 3.5 MB/s hdd-multi 2.6.33.reduce 33.8 ± 3.7 MB/s hdd-multi 2.6.33.preempt 33.4 ± 3.1 MB/s hdd-single 2.6.33.nopreempt 35.9 ± 2.1 MB/s hdd-single 2.6.33.reduce 36.6 ± 2.3 MB/s hdd-single 2.6.33.preempt 35.9 ± 2.0 MB/s ramdisk-multi 2.6.33.nopreempt 189.7 ± 9.2 MB/s ramdisk-multi 2.6.33.reduce 191.4 ± 9.5 MB/s ramdisk-multi 2.6.33.preempt 163.5 ± 9.4 MB/s ramdisk-single 2.6.33.nopreempt 152.3 ± 10.9 MB/s ramdisk-single 2.6.33.reduce 171.3 ± 17.0 MB/s ramdisk-single 2.6.33.preempt 144.2 ± 15.2 MB/s The tests were run on a laptop with dual AMD Turion 2 GHz, 2 GB RAM. A newly created filesystem was used for every fio run. In the 'hdd' tests the filesystem was on a 24 GB LV on a harddisk. These tests were repeated 12 times. - In the '-single' variant a single process wrote a 5 GB file. - In the '-multi' variant 5 processes wrote a 1 GB file each. In the 'ramdisk' tests the filesystem was on a 1.5 GB ramdisk. These tests were repeated >40 times. - In the '-single' variant a single process wrote a 1400 MB file. - In the '-multi' variant 5 processes wrote a 280 MB file each. The kernels were: '2.6.33.nopreempt' - vanilla 2.6.33 with CONFIG_PREEMPT_NONE '2.6.33.reduce' - the same + the patch to add the cond_resched() '2.6.33.preempt' - 2.6.33 with CONFIG_PREEMPT (for curiosity) The data for 'aggregate bandwidth' were taken from fio's 'aggrb' result. The margin of error as reported in the table is 2 * standard deviation. Conclusion: Adding the cond_resched() did not result in any measurable performance decrease of sequential writes. (The results show a performance increase, but it's within the margin of error.) > I don't think this is the best way to fix this problem, though. The > real right answer is to change how the code is structued. All of the > callsites that call mpage_da_submit_io() are immediately preceeded by > mpage_da_map_blocks(). These two functions should be combined and > instead of calling ext4_writepage() for each page, > mpage_da_map_and_write_blocks() should make a single call to > submit_bio() for each extent. That should far more CPU efficient, > solving both your scheduling latency issue as well as helping out for > benchmarks that strive to stress both the disk and CPU simultaneously > (such as for example the TPC benchmarks). > > This will also make our blktrace results much more compact, and Chris > Mason will be very happy about that! You're almost certainly right, but I'm not likely to make such a change in the near future. Michal -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html