Re: [patch 10/10] mm: keep page cache radix tree nodes in check

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote:
> On Mon,  3 Feb 2014 19:53:42 -0500 Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.

    ^^^^^^^^^^^^^^^
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >    from the node up to the tree root, which is needed to perform a
> >    deletion.  To solve this, encode in each node its offset inside the
> >    parent.  This can be stored in the unused upper bits of the same
> >    member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >    regular entries, to quickly detect when the node is ready to go to
> >    the shadow node LRU list.  The current entry count is an unsigned
> >    int but the maximum number of entries is 64, so a shadow counter
> >    can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >    in the address space, so store an address_space backpointer in the
> >    node.  The parent pointer of the node is in a union with the 2-word
> >    rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >    head inside the node.  This does increase the size of the node, but
> >    it does not change the number of objects that fit into a slab page.
> 
> changelog forgot to mention that this reclaim is performed via a
> shrinker...

Uhm...  see above? :)

> How expensive is that list walk in scan_shadow_nodes()?  I assume in
> the best case it will bale out after nr_to_scan iterations?

Yes, it scans sc->nr_to_scan radix tree nodes, cleans their pointers,
and frees them.

I ran a worst-case scenario on an 8G machine that creates one 8T
sparse file and faults one page per 64-page radix tree node, i.e. one
node per sparse file fault at CPU speed.  The profile:

     1       9.21%     radixblow  [kernel.kallsyms]   [k] memset
     2       7.23%     radixblow  [kernel.kallsyms]   [k] do_mpage_readpage
     3       4.76%     radixblow  [kernel.kallsyms]   [k] copy_user_generic_string
     4       3.85%     radixblow  [kernel.kallsyms]   [k] __radix_tree_lookup
     5       3.32%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate
     6       2.92%     radixblow  [kernel.kallsyms]   [k] get_page_from_freelist
     7       2.81%       kswapd0  [kernel.kallsyms]   [k] __delete_from_page_cache
     8       2.50%     radixblow  [kernel.kallsyms]   [k] radix_tree_node_ctor
     9       1.79%     radixblow  [kernel.kallsyms]   [k] _raw_spin_lock_irq
    10       1.70%       kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common

Same scenario with 4 pages per 64-page radix tree node:

    13       1.39%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

16 pages per 64-page node:

    75       0.20%       kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

So I doubt this will bother anyone, especially since most use-once
streamers should have a better population density and populate cache
at disk speed, not CPU speed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]