On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote: > On Mon, 3 Feb 2014 19:53:42 -0500 Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > Previously, page cache radix tree nodes were freed after reclaim > > emptied out their page pointers. But now reclaim stores shadow > > entries in their place, which are only reclaimed when the inodes > > themselves are reclaimed. This is problematic for bigger files that > > are still in use after they have a significant amount of their cache > > reclaimed, without any of those pages actually refaulting. The shadow > > entries will just sit there and waste memory. In the worst case, the > > shadow entries will accumulate until the machine runs out of memory. > > > > To get this under control, the VM will track radix tree nodes > > exclusively containing shadow entries on a per-NUMA node list. > > Per-NUMA rather than global because we expect the radix tree nodes > > themselves to be allocated node-locally and we want to reduce > > cross-node references of otherwise independent cache workloads. A > > simple shrinker will then reclaim these nodes on memory pressure. ^^^^^^^^^^^^^^^ > > A few things need to be stored in the radix tree node to implement the > > shadow node LRU and allow tree deletions coming from the list: > > > > 1. There is no index available that would describe the reverse path > > from the node up to the tree root, which is needed to perform a > > deletion. To solve this, encode in each node its offset inside the > > parent. This can be stored in the unused upper bits of the same > > member that stores the node's height at no extra space cost. > > > > 2. The number of shadow entries needs to be counted in addition to the > > regular entries, to quickly detect when the node is ready to go to > > the shadow node LRU list. The current entry count is an unsigned > > int but the maximum number of entries is 64, so a shadow counter > > can easily be stored in the unused upper bits. > > > > 3. Tree modification needs tree lock and tree root, which are located > > in the address space, so store an address_space backpointer in the > > node. The parent pointer of the node is in a union with the 2-word > > rcu_head, so the backpointer comes at no extra cost as well. > > > > 4. The node needs to be linked to an LRU list, which requires a list > > head inside the node. This does increase the size of the node, but > > it does not change the number of objects that fit into a slab page. > > changelog forgot to mention that this reclaim is performed via a > shrinker... Uhm... see above? :) > How expensive is that list walk in scan_shadow_nodes()? I assume in > the best case it will bale out after nr_to_scan iterations? Yes, it scans sc->nr_to_scan radix tree nodes, cleans their pointers, and frees them. I ran a worst-case scenario on an 8G machine that creates one 8T sparse file and faults one page per 64-page radix tree node, i.e. one node per sparse file fault at CPU speed. The profile: 1 9.21% radixblow [kernel.kallsyms] [k] memset 2 7.23% radixblow [kernel.kallsyms] [k] do_mpage_readpage 3 4.76% radixblow [kernel.kallsyms] [k] copy_user_generic_string 4 3.85% radixblow [kernel.kallsyms] [k] __radix_tree_lookup 5 3.32% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate 6 2.92% radixblow [kernel.kallsyms] [k] get_page_from_freelist 7 2.81% kswapd0 [kernel.kallsyms] [k] __delete_from_page_cache 8 2.50% radixblow [kernel.kallsyms] [k] radix_tree_node_ctor 9 1.79% radixblow [kernel.kallsyms] [k] _raw_spin_lock_irq 10 1.70% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common Same scenario with 4 pages per 64-page radix tree node: 13 1.39% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate 16 pages per 64-page node: 75 0.20% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate So I doubt this will bother anyone, especially since most use-once streamers should have a better population density and populate cache at disk speed, not CPU speed. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html