On Wed 26-10-16 14:15:09, Johannes Weiner wrote: > On Wed, Oct 26, 2016 at 11:21:07AM +0200, Jan Kara wrote: > > On Mon 24-10-16 14:47:39, Johannes Weiner wrote: > > > From: Johannes Weiner <hannes@xxxxxxxxxxx> > > > Date: Mon, 17 Oct 2016 09:00:04 -0400 > > > Subject: [PATCH] mm: workingset: restore single-page file refault tracking > > > > > > Currently, we account shadow entries in the page cache in the upper > > > bits of the radix_tree_node->count, behind the back of the radix tree > > > implementation. Because the radix tree code has no awareness of them, > > > we have to prevent shadow entries from going through operations where > > > the tree implementation relies on or modifies node->count: extending > > > and shrinking the tree from and to a single direct root->rnode entry. > > > > > > As a consequence, we cannot store shadow entries for files that only > > > have index 0 populated, and thus cannot detect refaults from them, > > > which in turn degrades the thrashing compensation in LRU reclaim. > > > > > > Another consequence is that we rely on subtleties throughout the radix > > > tree code, such as the node->count != 1 check in the shrinking code, > > > which is meant to exclude multi-entry nodes but also skips nodes with > > > only one shadow entry since they are accounted in the upper bits. This > > > is error prone, and has in fact caused the bug fixed in d3798ae8c6f3 > > > ("mm: filemap: don't plant shadow entries without radix tree node"). > > > > > > To fix this, this patch moves the shadow counter from the upper bits > > > of node->count into the new node->exceptional counter, where all > > > exceptional entries are explicitely tracked by the radix tree. > > > node->count then counts all tree entries again, including shadows. > > > > > > Switching from a magic node->count to accounting exceptional entries > > > natively in the radix tree code removes the fragile subtleties > > > mentioned above. It also allows us to store shadow entries for > > > single-page files again, as the radix tree recognizes exceptional > > > entries when extending the tree from the root->rnode singleton, and > > > thus restore refault detection and thrashing compensation for them. > > > > I like this solution. > > Thanks Jan. > > > Just one suggestion: I think radix_tree_replace_slot() can now do > > the node counter update on its own and that would save us having to > > do quite a bit of accounting outside of the radix tree code itself > > and it would be less prone to bugs (forgotten updates of a > > counter). What do you think? > > This would be nice indeed, but it's bigger surgery. We need the node > in the context of existing users that do slot lookup and replacement, > which is easier for individual lookups, and harder for gang lookups > (e.g. drivers/sh/intc/virq.c::intc_subgroup_map). And they'd all get > more complicated, AFAICS, without even using exceptional entries. Hum, I agree. But actually looking at e.g. the usage of radix_tree_replace_slot() in mm/khugepaged.c I think this is even buggy when replacing a slot referencing a page with NULL without updating node->count. It is in the error bail-out path so I'm not surprised we did not stumble over it. So a relatively easy solution looks like: Create new function like radix_tree_replace_node_slot() taking both node & slot and updating the accounting information as needed. Make this function WARN if node is NULL and accounting information should be updated. Make original radix_tree_replace_slot() a wrapper around radix_tree_replace_node_slot() passing NULL as node. This should provide safe accounting info updates without forcing all users to work with node pointers... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>