On Mon, Mar 02, 2015 at 08:48:05AM +1100, Dave Chinner wrote: > On Sat, Feb 28, 2015 at 05:15:58PM -0500, Johannes Weiner wrote: > > On Sat, Feb 28, 2015 at 11:41:58AM -0500, Theodore Ts'o wrote: > > > On Sat, Feb 28, 2015 at 11:29:43AM -0500, Johannes Weiner wrote: > > > > > > > > I'm trying to figure out if the current nofail allocators can get > > > > their memory needs figured out beforehand. And reliably so - what > > > > good are estimates that are right 90% of the time, when failing the > > > > allocation means corrupting user data? What is the contingency plan? > > > > > > In the ideal world, we can figure out the exact memory needs > > > beforehand. But we live in an imperfect world, and given that block > > > devices *also* need memory, the answer is "of course not". We can't > > > be perfect. But we can least give some kind of hint, and we can offer > > > to wait before we get into a situation where we need to loop in > > > GFP_NOWAIT --- which is the contingency/fallback plan. > > > > Overestimating should be fine, the result would a bit of false memory > > pressure. But underestimating and looping can't be an option or the > > original lockups will still be there. We need to guarantee forward > > progress or the problem is somewhat mitigated at best - only now with > > quite a bit more complexity in the allocator and the filesystems. > > The additional complexity in XFS is actually quite minor, and > initial "rough worst case" memory usage estimates are not that hard > to measure.... And, just to point out that the OOM killer can be invoked without a single transaction-based filesystem ENOMEM failure, here's what xfs/084 does on 4.0-rc1: [ 148.820369] resvtest invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [ 148.822113] resvtest cpuset=/ mems_allowed=0 [ 148.823124] CPU: 0 PID: 4342 Comm: resvtest Not tainted 4.0.0-rc1-dgc+ #825 [ 148.824648] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 148.826471] 0000000000000000 ffff88003ba2b988 ffffffff81dcb570 000000000000000c [ 148.828220] ffff88003bb06380 ffff88003ba2ba08 ffffffff81dc5c2f 0000000000000000 [ 148.829958] 0000000000000000 ffff88003ba2b9a8 0000000000000206 ffff88003ba2b9d8 [ 148.831734] Call Trace: [ 148.832325] [<ffffffff81dcb570>] dump_stack+0x4c/0x65 [ 148.833493] [<ffffffff81dc5c2f>] dump_header.isra.12+0x79/0x1cb [ 148.834855] [<ffffffff8117db69>] oom_kill_process+0x1c9/0x3b0 [ 148.836195] [<ffffffff810a7105>] ? has_capability_noaudit+0x25/0x40 [ 148.837633] [<ffffffff8117e0c5>] __out_of_memory+0x315/0x500 [ 148.838925] [<ffffffff8117e44b>] out_of_memory+0x5b/0x80 [ 148.840162] [<ffffffff811830d9>] __alloc_pages_nodemask+0x7d9/0x810 [ 148.841592] [<ffffffff811c0531>] alloc_pages_current+0x91/0x100 [ 148.842950] [<ffffffff8117a427>] __page_cache_alloc+0xa7/0xc0 [ 148.844286] [<ffffffff8117c688>] filemap_fault+0x1b8/0x420 [ 148.845545] [<ffffffff811a05ed>] __do_fault+0x3d/0x70 [ 148.846706] [<ffffffff811a4478>] handle_mm_fault+0x988/0x1230 [ 148.848042] [<ffffffff81090305>] __do_page_fault+0x1a5/0x460 [ 148.849333] [<ffffffff81090675>] trace_do_page_fault+0x45/0x130 [ 148.850681] [<ffffffff8108b8ce>] do_async_page_fault+0x1e/0xd0 [ 148.852025] [<ffffffff81dd1567>] ? schedule+0x37/0x90 [ 148.853187] [<ffffffff81dd8b88>] async_page_fault+0x28/0x30 [ 148.854456] Mem-Info: [ 148.854986] Node 0 DMA per-cpu: [ 148.855727] CPU 0: hi: 0, btch: 1 usd: 0 [ 148.856820] Node 0 DMA32 per-cpu: [ 148.857600] CPU 0: hi: 186, btch: 31 usd: 0 [ 148.858688] active_anon:119251 inactive_anon:119329 isolated_anon:0 [ 148.858688] active_file:19 inactive_file:2 isolated_file:0 [ 148.858688] unevictable:0 dirty:0 writeback:0 unstable:0 [ 148.858688] free:1965 slab_reclaimable:2816 slab_unreclaimable:2184 [ 148.858688] mapped:3 shmem:2 pagetables:1259 bounce:0 [ 148.858688] free_cma:0 [ 148.865606] Node 0 DMA free:3916kB min:60kB low:72kB high:88kB active_anon:5100kB inactive_anon:5324kB active_file:0kB inactive_file:8kB unevictable:0kB isolated(as [ 148.874431] lowmem_reserve[]: 0 966 966 966 [ 148.875504] Node 0 DMA32 free:3944kB min:3944kB low:4928kB high:5916kB active_anon:471904kB inactive_anon:471992kB active_file:76kB inactive_file:0kB unevictable:0s [ 148.884817] lowmem_reserve[]: 0 0 0 0 [ 148.885770] Node 0 DMA: 1*4kB (M) 1*8kB (U) 2*16kB (UM) 3*32kB (UM) 1*64kB (M) 1*128kB (M) 0*256kB 1*512kB (M) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 3916kB [ 148.889385] Node 0 DMA32: 8*4kB (UEM) 2*8kB (UR) 3*16kB (M) 1*32kB (M) 2*64kB (MR) 1*128kB (R) 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3968kB [ 148.893068] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 148.894949] 47361 total pagecache pages [ 148.895816] 47334 pages in swap cache [ 148.896657] Swap cache stats: add 124669, delete 77335, find 83/169 [ 148.898057] Free swap = 0kB [ 148.898714] Total swap = 497976kB [ 148.899470] 262044 pages RAM [ 148.900145] 0 pages HighMem/MovableOnly [ 148.901006] 10253 pages reserved [ 148.901735] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [ 148.903637] [ 1204] 0 1204 6039 1 15 3 163 -1000 udevd [ 148.905571] [ 1323] 0 1323 6038 1 14 3 165 -1000 udevd [ 148.907499] [ 1324] 0 1324 6038 1 14 3 164 -1000 udevd [ 148.909439] [ 2176] 0 2176 2524 0 6 2 571 0 dhclient [ 148.911427] [ 2227] 0 2227 9267 0 22 3 95 0 rpcbind [ 148.913392] [ 2632] 0 2632 64981 30 29 3 136 0 rsyslogd [ 148.915391] [ 2686] 0 2686 1062 1 6 3 36 0 acpid [ 148.917325] [ 2826] 0 2826 4753 0 12 2 44 0 atd [ 148.919209] [ 2877] 0 2877 6473 0 17 3 66 0 cron [ 148.921120] [ 2911] 104 2911 7078 1 17 3 81 0 dbus-daemon [ 148.923150] [ 3591] 0 3591 13731 0 28 2 165 -1000 sshd [ 148.925073] [ 3603] 0 3603 22024 0 43 2 215 0 winbindd [ 148.927066] [ 3612] 0 3612 22024 0 42 2 216 0 winbindd [ 148.929062] [ 3636] 0 3636 3722 1 11 3 41 0 getty [ 148.930981] [ 3637] 0 3637 3722 1 11 3 40 0 getty [ 148.932915] [ 3638] 0 3638 3722 1 11 3 39 0 getty [ 148.934835] [ 3639] 0 3639 3722 1 11 3 40 0 getty [ 148.936789] [ 3640] 0 3640 3722 1 11 3 40 0 getty [ 148.938704] [ 3641] 0 3641 3722 1 10 3 38 0 getty [ 148.940635] [ 3642] 0 3642 3677 1 11 3 40 0 getty [ 148.942550] [ 3643] 0 3643 25894 2 52 2 248 0 sshd [ 148.944469] [ 3649] 0 3649 146652 1 35 4 320 0 console-kit-dae [ 148.946578] [ 3716] 0 3716 48287 1 31 4 171 0 polkitd [ 148.948552] [ 3722] 1000 3722 25894 0 51 2 250 0 sshd [ 148.950457] [ 3723] 1000 3723 5435 3 15 3 495 0 bash [ 148.952375] [ 3742] 0 3742 17157 1 37 2 160 0 sudo [ 148.954275] [ 3743] 0 3743 3365 1 11 3 516 0 check [ 148.956229] [ 4130] 0 4130 3334 1 11 3 484 0 084 [ 148.958108] [ 4342] 0 4342 314556 191159 619 4 119808 0 resvtest [ 148.960104] [ 4343] 0 4343 3334 0 11 3 485 0 084 [ 148.961990] [ 4344] 0 4344 3334 0 11 3 485 0 084 [ 148.963876] [ 4345] 0 4345 3305 0 11 3 36 0 sed [ 148.965766] [ 4346] 0 4346 3305 0 11 3 37 0 sed [ 148.967652] Out of memory: Kill process 4342 (resvtest) score 803 or sacrifice child [ 148.969390] Killed process 4342 (resvtest) total-vm:1258224kB, anon-rss:764636kB, file-rss:0kB [ 149.415288] XFS (vda): Unmounting Filesystem [ 150.211229] XFS (vda): Mounting V5 Filesystem [ 150.292092] XFS (vda): Ending clean mount [ 150.342307] XFS (vda): Unmounting Filesystem [ 150.346522] XFS (vdb): Unmounting Filesystem [ 151.264135] XFS: kmalloc allocations by trans type [ 151.265195] XFS: 3: count 7, bytes 3992, fails 0, max_size 1024 [ 151.266479] XFS: 4: count 3, bytes 400, fails 0, max_size 144 [ 151.267735] XFS: 7: count 9, bytes 2784, fails 0, max_size 536 [ 151.269022] XFS: 16: count 1, bytes 696, fails 0, max_size 696 [ 151.270286] XFS: 26: count 1, bytes 384, fails 0, max_size 384 [ 151.271550] XFS: 35: count 1, bytes 696, fails 0, max_size 696 [ 151.272833] XFS: slab allocations by trans type [ 151.273818] XFS: 3: count 22, bytes 0, fails 0, max_size 0 [ 151.275010] XFS: 4: count 13, bytes 0, fails 0, max_size 0 [ 151.276212] XFS: 7: count 12, bytes 0, fails 0, max_size 0 [ 151.277406] XFS: 15: count 2, bytes 0, fails 0, max_size 0 [ 151.278595] XFS: 16: count 10, bytes 0, fails 0, max_size 0 [ 151.279854] XFS: 18: count 2, bytes 0, fails 0, max_size 0 [ 151.281080] XFS: 26: count 3, bytes 0, fails 0, max_size 0 [ 151.282275] XFS: 35: count 2, bytes 0, fails 0, max_size 0 [ 151.283476] XFS: vmalloc allocations by trans type [ 151.284535] XFS: page allocations by trans type Those XFS allocation stats are largest measured allocations done under transaction context broken down by allocation and transaction type. No failures that would result in looping, even though the system invoked the OOM killer on a filesystem workload.... I need to break the slab allocations down further by cache (other workloads are generating over 50 slab allocations per transaction), but another hour's work and a few days of observation of the stats in my normal day-to-day work wll get me all the information I need to do a decent first pass at memory reservation requirements for XFS. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>