On Wed, 24 Jun 2015, Vlastimil Babka wrote: > [add some CC's] > > On 06/19/2015 05:16 PM, Mark Hills wrote: > > I noticed that any change to vm.dirty_ratio causes write throuput to > > plummet -- to around 5Mbyte/sec. > > > > <system bootup, kernel 4.0.5> > > > > # dd if=/dev/zero of=/path/to/file bs=1M > > > > # sysctl vm.dirty_ratio > > vm.dirty_ratio = 20 > > <all ok; writes at ~150Mbyte/sec> > > > > # sysctl vm.dirty_ratio=20 > > <all continues to be ok> > > > > # sysctl vm.dirty_ratio=21 > > <writes drop to ~5Mbyte/sec> > > > > # sysctl vm.dirty_ratio=20 > > <writes continue to be slow at ~5Mbyte/sec> > > > > The test shows that return to the previous value does not restore the old > > behaviour. I return the system to usable state with a reboot. > > > > Reads continue to be fast and are not affected. > > > > A quick look at the code suggests differing behaviour from > > writeback_set_ratelimit on startup. And that some of the calculations (eg. > > global_dirty_limit) is badly behaved once the system has booted. > > Hmm, so the only thing that dirty_ratio_handler() changes except the > vm_dirty_ratio itself, is ratelimit_pages through writeback_set_ratelimit(). So > I assume the problem is with ratelimit_pages. There's num_online_cpus() used in > the calculation, which I think would differ between the initial system state > (where we are called by page_writeback_init()) and later when all CPU's are > onlined. But I don't see CPU onlining code updating the limit (unlike memory > hotplug which does that), so that's suspicious. > > Another suspicious thing is that global_dirty_limits() looks at current > process's flag. It seems odd to me that the process calling the sysctl would > determine a value global to the system. Yes, I also spotted this. The fragment of code is: tsk = current; if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { background += background / 4; dirty += dirty / 4; } It seems to imply the code was not always used from the /proc interface. It's relevant in a moment... > If you are brave enough (and have kernel configured properly and with > debuginfo), I'm brave... :) I hadn't seen this tool before, thanks for introducing me to it, I will use it more now, I'm sure. > you can verify how value of ratelimit_pages variable changes on the live > system, using the crash tool. Just start it, and if everything works, > you can inspect the live system. It's a bit complicated since there are > two static variables called "ratelimit_pages" in the kernel so we can't > print them easily (or I don't know how). First we have to get the > variable address: > > crash> sym ratelimit_pages > ffffffff81e67200 (d) ratelimit_pages > ffffffff81ef4638 (d) ratelimit_pages > > One will be absurdly high (probably less on your 32bit) so it's not the one we want: > > crash> rd -d ffffffff81ef4638 1 > ffffffff81ef4638: 4294967328768 > > The second will have a smaller value: > (my system after boot with dirty ratio = 20) > crash> rd -d ffffffff81e67200 1 > ffffffff81e67200: 1577 > > (after changing to 21) > crash> rd -d ffffffff81e67200 1 > ffffffff81e67200: 1570 > > (after changing back to 20) > crash> rd -d ffffffff81e67200 1 > ffffffff81e67200: 1496 In my case there's only one such symbol (perhaps because this kernel config is quite slimmed down?) crash> sym ratelimit_pages c148b618 (d) ratelimit_pages (bootup with dirty_ratio 20) crash> rd -d ratelimit_pages c148b618: 78 (after changing to 21) crash> rd -d ratelimit_pages c148b618: 16 (after changing back to 20) crash> rd -d ratelimit_pages c148b618: 16 Compared to your system, even the bootup value seems pretty low. So I am new to this code, but I took a look. Seems like we're basically hitting the lower bound of 16. void writeback_set_ratelimit(void) { unsigned long background_thresh; unsigned long dirty_thresh; global_dirty_limits(&background_thresh, &dirty_thresh); global_dirty_limit = dirty_thresh; ratelimit_pages = dirty_thresh / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16; } >From this code, we don't have dirty_thresh preserved, but we do have global_dirty_limit: crash> rd -d global_dirty_limit c1545080: 0 And if that is zero then: ratelimit_pages = 0 / (num_online_cpus() * 32) = 0 So it seems like this is the path to follow. The function global_dirty_limits() produces the value for dirty_thresh and, aside from a potential multiply by 0.25 (the 'task dependent' mentioned before) the value is derived as: if (vm_dirty_bytes) dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); else dirty = (vm_dirty_ratio * available_memory) / 100; I checked the vm_dirty_bytes codepath and that works: (vm.dirty_bytes = 1048576000, 1000Mb) crash> rd -d ratelimit_pages c148b618: 1000 Therefore it's the 'else' case, and this points to available_memory is zero, or near it (in my case < 5). This value is the direct result of global_dirtyable_memory(), which I've annotated with some values: static unsigned long global_dirtyable_memory(void) { unsigned long x; x = global_page_state(NR_FREE_PAGES); // 2648091 x -= min(x, dirty_balance_reserve); // - 175522 x += global_page_state(NR_INACTIVE_FILE); // + 156369 x += global_page_state(NR_ACTIVE_FILE); // + 3475 = 2632413 if (!vm_highmem_is_dirtyable) x -= highmem_dirtyable_memory(x); return x + 1; /* Ensure that we never return 0 */ } If I'm correct here, global includes the highmem stuff, and it implies that highmem_dirtyable_memory() is returning a value only slightly less than or equal to the sum of the others. To test, I flipped the vm_highmem_is_dirtyable (which had no effect until I forced it to re-evaluate ratelimit_pages): $ echo 1 > /proc/sys/vm/highmem_is_dirtyable $ echo 21 > /proc/sys/vm/dirty_ratio $ echo 20 > /proc/sys/vm/dirty_ratio crash> rd -d ratelimit_pages c148b618: 2186 The value is now healthy, more so than even the value we started with on bootup. My questions and observations are: * What does highmem_is_dirtyable actually mean, and should it really default to 1? Is it actually a misnomer? Since it's only used in global_dirtyable_memory(), it doesn't actually prevent dirtying of highmem, it just attempts to place a limit that corresponds to the amount of non-highmem.I have limited understanding at the moment, but that would be something different. * That the codepaths around setting highmem_is_dirtyable from /proc is broken; it also needs to make a call to writeback_set_ratelimit() * Even with highmem_is_dirtyable=1, there's still a sizeable difference between the value on bootup (78) and the evaluation once booted (2186). This goes the wrong direction and is far too big a difference to be solely nr_cpus_online() switching from 1 to 8. The machine is 32-bit with 12GiB of RAM. For info, I posted a typical zoneinfo, below. > So yes, it does differ but not drastically. A difference between 1 and 8 > online CPU's would look differently I think. So my theory above is > questionable. But you might try what it looks like on your system... > > > > > The system is an HP xw6600, running i686 kernel. This happens whether > > internal SATA HDD, SSD or external USB drive is used. I first saw this on > > kernel 4.0.4, and 4.0.5 is also affected. > > So what was the last version where you did change the dirty ratio and it worked > fine? Sorry, I don't know when it broke. I don't immediately have access to an old kernel to test, but I could do that if necessary. > > It would suprise me if I'm the only person who was setting dirty_ratio. > > > > Have others seen this behaviour? Thanks > > > Thanks, I hope you find this useful. -- Mark Node 0, zone DMA pages free 1566 min 196 low 245 high 294 scanned 0 spanned 4095 present 3989 managed 3970 nr_free_pages 1566 nr_alloc_batch 49 nr_inactive_anon 0 nr_active_anon 0 nr_inactive_file 163 nr_active_file 1129 nr_unevictable 0 nr_mlock 0 nr_anon_pages 0 nr_mapped 0 nr_file_pages 1292 nr_dirty 0 nr_writeback 0 nr_slab_reclaimable 842 nr_slab_unreclaimable 162 nr_page_table_pages 17 nr_kernel_stack 4 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_vmscan_immediate_reclaim 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 0 nr_dirtied 661 nr_written 661 nr_pages_scanned 0 workingset_refault 0 workingset_activate 0 workingset_nodereclaim 0 nr_anon_transparent_hugepages 0 nr_free_cma 0 protection: (0, 377, 12165, 12165) pagesets cpu: 0 count: 0 high: 0 batch: 1 vm stats threshold: 8 cpu: 1 count: 0 high: 0 batch: 1 vm stats threshold: 8 cpu: 2 count: 0 high: 0 batch: 1 vm stats threshold: 8 cpu: 3 count: 0 high: 0 batch: 1 vm stats threshold: 8 cpu: 4 count: 0 high: 0 batch: 1 vm stats threshold: 8 cpu: 5 count: 0 high: 0 batch: 1 vm stats threshold: 8 cpu: 6 count: 0 high: 0 batch: 1 vm stats threshold: 8 cpu: 7 count: 0 high: 0 batch: 1 vm stats threshold: 8 all_unreclaimable: 0 start_pfn: 1 inactive_ratio: 1 Node 0, zone Normal pages free 37336 min 4789 low 5986 high 7183 scanned 0 spanned 123902 present 123902 managed 96773 nr_free_pages 37336 nr_alloc_batch 331 nr_inactive_anon 0 nr_active_anon 0 nr_inactive_file 4016 nr_active_file 26672 nr_unevictable 0 nr_mlock 0 nr_anon_pages 0 nr_mapped 1 nr_file_pages 30684 nr_dirty 4 nr_writeback 0 nr_slab_reclaimable 19865 nr_slab_unreclaimable 4673 nr_page_table_pages 1027 nr_kernel_stack 281 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_vmscan_immediate_reclaim 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 0 nr_dirtied 14354 nr_written 21672 nr_pages_scanned 0 workingset_refault 0 workingset_activate 0 workingset_nodereclaim 0 nr_anon_transparent_hugepages 0 nr_free_cma 0 protection: (0, 0, 94302, 94302) pagesets cpu: 0 count: 78 high: 186 batch: 31 vm stats threshold: 24 cpu: 1 count: 140 high: 186 batch: 31 vm stats threshold: 24 cpu: 2 count: 116 high: 186 batch: 31 vm stats threshold: 24 cpu: 3 count: 100 high: 186 batch: 31 vm stats threshold: 24 cpu: 4 count: 70 high: 186 batch: 31 vm stats threshold: 24 cpu: 5 count: 82 high: 186 batch: 31 vm stats threshold: 24 cpu: 6 count: 144 high: 186 batch: 31 vm stats threshold: 24 cpu: 7 count: 59 high: 186 batch: 31 vm stats threshold: 24 all_unreclaimable: 0 start_pfn: 4096 inactive_ratio: 1 Node 0, zone HighMem pages free 2536526 min 128 low 37501 high 74874 scanned 0 spanned 3214338 present 3017668 managed 3017668 nr_free_pages 2536526 nr_alloc_batch 10793 nr_inactive_anon 2118 nr_active_anon 118021 nr_inactive_file 80138 nr_active_file 273523 nr_unevictable 3475 nr_mlock 3475 nr_anon_pages 119672 nr_mapped 48158 nr_file_pages 357567 nr_dirty 0 nr_writeback 0 nr_slab_reclaimable 0 nr_slab_unreclaimable 0 nr_page_table_pages 0 nr_kernel_stack 0 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 nr_vmscan_immediate_reclaim 0 nr_writeback_temp 0 nr_isolated_anon 0 nr_isolated_file 0 nr_shmem 2766 nr_dirtied 1882996 nr_written 1695681 nr_pages_scanned 0 workingset_refault 0 workingset_activate 0 workingset_nodereclaim 0 nr_anon_transparent_hugepages 151 nr_free_cma 0 protection: (0, 0, 0, 0) pagesets cpu: 0 count: 171 high: 186 batch: 31 vm stats threshold: 64 cpu: 1 count: 80 high: 186 batch: 31 vm stats threshold: 64 cpu: 2 count: 91 high: 186 batch: 31 vm stats threshold: 64 cpu: 3 count: 173 high: 186 batch: 31 vm stats threshold: 64 cpu: 4 count: 114 high: 186 batch: 31 vm stats threshold: 64 cpu: 5 count: 159 high: 186 batch: 31 vm stats threshold: 64 cpu: 6 count: 130 high: 186 batch: 31 vm stats threshold: 64 cpu: 7 count: 62 high: 186 batch: 31 vm stats threshold: 64 all_unreclaimable: 0 start_pfn: 127998 inactive_ratio: 10 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>