Re: Page reference leak in fscache or cachefiles?

Milosz Tanski <milosz@xxxxxxxxx> · Fri, 25 Jul 2014 11:17:34 -0400

The question is? Does it become un-reclaimable period (it's still full
when you force the kernel to dump caches)... or it's unreclaimable by
other processes allocating via GFP_NOFS?

Also, you should be able to port the fscache changes onto you're tree
by looking David's commit history.

- Milosz

On Fri, Jul 25, 2014 at 11:15 AM, Shantanu Goel <sgoel01@xxxxxxxxx> wrote:
> Our codebase is a bit older than 3.15.5 so the only patch I was able to integrate immediately and deploy is the one which papers over the wait_on_page_write deadlock.  We are still evaluating the others.  The underlying problem of a page reference leak remains which I'm still trying to chase down.  It literally gets to the point where the entire file page cache becomes unreclaimable.
>
> Thanks,
> Shantanu
>
>
>> On Friday, July 25, 2014 11:04 AM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>> > Did you have a chance to try any of my patches?
>>
>> On Tue, Jul 22, 2014 at 10:11 AM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>>>  Shantanu,
>>>
>>>  You are correct in the way that there is a correlation between those
>>>  messages and fscache hangs. The correlation is that there are a number
>>>  of bugs in fscache code when things fail (like when allocation fails).
>>>  I have 3 different patches that I'm going to submit later on for
>>>  review for these issues against 3.15.5. I'll post them late on today
>>>  when I get a free minute.
>>>
>>>  Also, sorry for not CCing the list in the first place.
>>>
>>>  On Tue, Jul 22, 2014 at 8:56 AM, Shantanu Goel <sgoel01@xxxxxxxxx>
>> wrote:
>>>>  Hi Milosz,
>>>>
>>>>  We thought they were harmless too initially, however we have seen quite
>> a few hangs at this point on hosts where we enabled fscache + cachefiles and
>> think there is some correlation with these messages.  It could be some generic
>> VM issue which is exacerbated by the use of fscache.  We posted our initial
>> message to get some ideas about where the problem might be since we are not
>> certain of where the problem lies.
>>>>
>>>>  Thanks,
>>>>  Shantanu
>>>>
>>>>
>>>>
>>>>
>>>>>  On Monday, July 21, 2014 9:31 PM, Milosz Tanski
>> <milosz@xxxxxxxxx> wrote:
>>>>>  > Shantanu,
>>>>>
>>>>>  This message is harmless. I see this a lot a well as we have
>> fscache
>>>>>  on a SSD disks that do 1.1GB/s transfer and a demanding application
>>>>>  and a lot of pages waiting for write out. The kernel cannot force a
>>>>>  write out because fscache has t o allocate pages with the GPF_NOFS
>>>>>  flag (to prevent a recursive hang).
>>>>>
>>>>>  On our production machines we changed the vm.min_free_kbytes kernel
>>>>>  tunable to 256Mb and (the machine has a lot of RAM) and this
>> happens a
>>>>>  lot less often.
>>>>>
>>>>>  - Milosz
>>>>>
>>>>>  On Wed, Jul 9, 2014 at 12:35 PM, Shantanu Goel
>> <sgoel01@xxxxxxxxx> wrote:
>>>>>>   Hi,
>>>>>>
>>>>>>   We are running Linux 3.10 + fscache-20130702 + commit
>>>>>  bae6b235905ab9dc6659395f7802c1d36fb63f15 from dhowells' git
>> tree and suspect
>>>>>  there might be a page reference leak somewhere as evidenced by the
>> low reclaim
>>>>>  ratio.  We also see the kernel complaining about memory allocation
>> failures in
>>>>>  the readahead code as seen below.
>>>>>>   The machine has plenty of filecache and there isn't any
>> heavy write
>>>>>  activity which could also pin the pages.
>>>>>>   We wrote a script to monitor /proc/vmstat and see the
>> following output with
>>>>>  a one second interval.
>>>>>>
>>>>>>
>>>>>>      Refill      Scan      Free   FrRatio      Slab      Runs
>> Stalls
>>>>>>            0    154663      6855         4      1024         2
>>        0
>>>>>>            0      7152      7152       100         0         4
>>        0
>>>>>>            0      5960      5960       100         0         4
>>        0
>>>>>>            0      7152      7152       100      1024         3
>>        0
>>>>>>            0    152407     21698        14      1024         2
>>        0
>>>>>>
>>>>>>
>>>>>>   The fields are:
>>>>>>
>>>>>>   Refill  - # active pages scanned, i.e. deactivated
>>>>>>   Scan    - # inactive pages scanned
>>>>>>   Free    - # pages freed
>>>>>>   FrRatio - free / scan ratio as percentage
>>>>>>   Runs    - # times kswapd ran
>>>>>>   Stalls  - # times direct reclaim ran
>>>>>>
>>>>>>
>>>>>>   The free memory looks like this (free -m):
>>>>>>
>>>>>>                total       used       free     shared
>> buffers     cached
>>>>>>   Mem:         24175      23986        188          0
>> 16      10453
>>>>>>   -/+ buffers/cache:      13517      10657
>>>>>>   Swap:         8191         85       8106
>>>>>>
>>>>>>
>>>>>>   Are you aware of any such issue with fscache / cachefiles?  If
>> not, could
>>>>>  you suggest what other information we could gather to debug it
>> further?
>>>>>>
>>>>>>
>>>>>>   Thanks,
>>>>>>   Shantanu
>>>>>>   SysRq : HELP : loglevel(0-9) reboot(b) crash(c)
>> terminate-all-tasks(e)
>>>>>  memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j)
>> sak(k)
>>>>>  show-backtrace-all-active-cpus(l) show-memory-usage(m)
>> nice-all-RT-tasks(n)
>>>>>  poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s)
>>>>>  show-task-states(t) unmount(u) show-blocked-tasks(w)
>> dump-ftrace-buffer(z)
>>>>>>   python: page allocation failure: order:0, mode:0x11110
>>>>>>   CPU: 5 PID: 18997 Comm: python Tainted: G        W  O
>>>>>  3.10.36-el5.ia32e.lime.0 #1
>>>>>>   Hardware name: Supermicro X8STi/X8STi, BIOS 2.0     06/03/2010
>>>>>>    0000000000000000 ffff880104d39578 ffffffff81426c43
>> ffff880104d39608
>>>>>>    ffffffff810fdcfc ffff88061fffd468 0000004000000000
>> ffff880104d395a8
>>>>>>    ffffffff810fd506 0000000000000000 ffffffff8109120d
>> 0001111000000000
>>>>>>   Call Trace:
>>>>>>    [<ffffffff81426c43>] dump_stack+0x19/0x1e
>>>>>>    [<ffffffff810fdcfc>] warn_alloc_failed+0xfc/0x140
>>>>>>    [<ffffffff810fd506>] ? drain_local_pages+0x16/0x20
>>>>>>    [<ffffffff8109120d>] ? on_each_cpu_mask+0x4d/0x70
>>>>>>    [<ffffffff810fef1c>] __alloc_pages_nodemask+0x6cc/0x910
>>>>>>    [<ffffffff81139a9a>] alloc_pages_current+0xba/0x160
>>>>>>    [<ffffffff810f6617>] __page_cache_alloc+0xa7/0xc0
>>>>>>    [<ffffffffa050a4ea>]
>> cachefiles_read_backing_file+0x2ba/0x7b0
>>>>>  [cachefiles]
>>>>>>    [<ffffffffa050ac87>]
>> cachefiles_read_or_alloc_pages+0x2a7/0x3d0
>>>>>  [cachefiles]
>>>>>>    [<ffffffff81062c4f>] ? wake_up_bit+0x2f/0x40
>>>>>>    [<ffffffffa04da24a>] ? fscache_run_op+0x5a/0xa0
>> [fscache]
>>>>>>    [<ffffffffa04daacb>] ? fscache_submit_op+0x1db/0x4c0
>> [fscache]
>>>>>>    [<ffffffffa04dbd65>]
>> __fscache_read_or_alloc_pages+0x1f5/0x2c0
>>>>>  [fscache]
>>>>>>    [<ffffffffa0531e9e>]
>> __nfs_readpages_from_fscache+0x7e/0x1b0 [nfs]
>>>>>>    [<ffffffffa052bc6a>] nfs_readpages+0xca/0x1f0 [nfs]
>>>>>>    [<ffffffff81139a9a>] ? alloc_pages_current+0xba/0x160
>>>>>>    [<ffffffff81101ae2>]
>> __do_page_cache_readahead+0x1b2/0x260
>>>>>>    [<ffffffff81101bb1>] ra_submit+0x21/0x30
>>>>>>    [<ffffffff81101e85>] ondemand_readahead+0x115/0x240
>>>>>>    [<ffffffff81102038>]
>> page_cache_async_readahead+0x88/0xb0
>>>>>>    [<ffffffff810f5f5e>] ? find_get_page+0x1e/0xa0
>>>>>>    [<ffffffff810f81fc>] generic_file_aio_read+0x4dc/0x720
>>>>>>    [<ffffffffa05224b9>] nfs_file_read+0x89/0x100 [nfs]
>>>>>>    [<ffffffff81158a4f>] do_sync_read+0x7f/0xb0
>>>>>>    [<ffffffff8115a3a5>] vfs_read+0xc5/0x190
>>>>>>    [<ffffffff8115a57f>] SyS_read+0x5f/0xa0
>>>>>>    [<ffffffff814327bb>] tracesys+0xdd/0xe2
>>>>>>   Mem-Info:
>>>>>>   Node 0 DMA32 per-cpu:
>>>>>>   CPU    0: hi:  186, btch:  31 usd:   2
>>>>>>   CPU    1: hi:  186, btch:  31 usd: 159
>>>>>>   CPU    2: hi:  186, btch:  31 usd:  82
>>>>>>   CPU    3: hi:  186, btch:  31 usd: 182
>>>>>>   CPU    4: hi:  186, btch:  31 usd:  57
>>>>>>   CPU    5: hi:  186, btch:  31 usd:   0
>>>>>>   Node 0 Normal per-cpu:
>>>>>>   CPU    0: hi:  186, btch:  31 usd:  22
>>>>>>   CPU    1: hi:  186, btch:  31 usd: 117
>>>>>>   CPU    2: hi:  186, btch:  31 usd:  16
>>>>>>   CPU    3: hi:  186, btch:  31 usd:  79
>>>>>>   CPU    4: hi:  186, btch:  31 usd:  59
>>>>>>   CPU    5: hi:  186, btch:  31 usd:   0
>>>>>>   active_anon:3461555 inactive_anon:451550 isolated_anon:0
>>>>>>    active_file:236104 inactive_file:1899298 isolated_file:0
>>>>>>    unevictable:467 dirty:109 writeback:0 unstable:0
>>>>>>    free:33860 slab_reclaimable:19654 slab_unreclaimable:5110
>>>>>>    mapped:15284 shmem:2 pagetables:10231 bounce:0
>>>>>>    free_cma:0
>>>>>>   Node 0 DMA32 free:91764kB min:4824kB low:6028kB high:7236kB
>>>>>  active_anon:2185528kB inactive_anon:566976kB active_file:108808kB
>>>>>  inactive_file:670996kB unevictable:0kB isolated(anon):0kB
>> isolated(file):0kB
>>>>>  present:3660960kB managed:3646160kB mlocked:0kB dirty:152kB
>> writeback:0kB
>>>>>  mapped:1344kB shmem:0kB slab_reclaimable:17300kB
>> slab_unreclaimable:580kB
>>>>>  kernel_stack:80kB pagetables:6448kB unstable:0kB bounce:0kB
>> free_cma:0kB
>>>>>  writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>>>>>>   lowmem_reserve[]: 0 20607 20607
>>>>>>   Node 0 Normal free:39164kB min:27940kB low:34924kB
>> high:41908kB
>>>>>  active_anon:11663332kB inactive_anon:1239224kB active_file:835608kB
>>>>>  inactive_file:6926636kB unevictable:1868kB isolated(anon):0kB
>> isolated(file):0kB
>>>>>  present:21495808kB managed:21101668kB mlocked:1868kB dirty:612kB
>> writeback:0kB
>>>>>  mapped:60120kB shmem:8kB slab_reclaimable:61536kB
>> slab_unreclaimable:19860kB
>>>>>  kernel_stack:2088kB pagetables:34804kB unstable:0kB bounce:0kB
>> free_cma:0kB
>>>>>  writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>>>>>>   lowmem_reserve[]: 0 0 0
>>>>>>   Node 0 DMA32: 506*4kB (UEM) 545*8kB (UEM) 569*16kB (UEM)
>> 897*32kB (UEM)
>>>>>  541*64kB (UEM) 87*128kB (UM) 4*256kB (UM) 2*512kB (M) 0*1024kB
>> 0*2048kB 0*4096kB
>>>>>  = 92000kB
>>>>>>   Node 0 Normal: 703*4kB (UE) 285*8kB (UEM) 176*16kB (UEM)
>> 836*32kB (UEM)
>>>>>  7*64kB (UEM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
>> 35108kB
>>>>>>   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
>>>>>  hugepages_size=2048kB
>>>>>>   2137331 total pagecache pages
>>>>>>   1215 pages in swap cache
>>>>>>   Swap cache stats: add 7056879, delete 7055664, find
>> 633567/703482
>>>>>>   Free swap  = 8301172kB
>>>>>>   Total swap = 8388604kB
>>>>>>   6291455 pages RAM
>>>>>>   102580 pages reserved
>>>>>>   4391935 pages shared
>>>>>>   4746779 pages non-shared
>>>>>>
>>>>>>   --
>>>>>>   Linux-cachefs mailing list
>>>>>>   Linux-cachefs@xxxxxxxxxx
>>>>>>   https://www.redhat.com/mailman/listinfo/linux-cachefs
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>>  Milosz Tanski
>>>>>  CTO
>>>>>  16 East 34th Street, 15th floor
>>>>>  New York, NY 10016
>>>>>
>>>>>  p: 646-253-9055
>>
>>>>>
>>>>>  e: milosz@xxxxxxxxx
>>>>>
>>>
>>>
>>>
>>>  --
>>>  Milosz Tanski
>>>  CTO
>>>  16 East 34th Street, 15th floor
>>>  New York, NY 10016
>>>
>>>  p: 646-253-9055
>>>  e: milosz@xxxxxxxxx
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 16 East 34th Street, 15th floor
>> New York, NY 10016
>>
>> p: 646-253-9055
>> e: milosz@xxxxxxxxx
>>

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx

--
Linux-cachefs mailing list
Linux-cachefs@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cachefs