Re: [PATCHv12 34/37] thp: introduce deferred_split_huge_page()

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Thu, 21 Jan 2016 02:22:37 +0100

Hello Kirill,

On Tue, Oct 06, 2015 at 06:24:01PM +0300, Kirill A. Shutemov wrote:
> +static unsigned long deferred_split_scan(struct shrinker *shrink,
> +		struct shrink_control *sc)
> +{
> +	unsigned long flags;
> +	LIST_HEAD(list), *pos, *next;
> +	struct page *page;
> +	int split = 0;
> +
> +	spin_lock_irqsave(&split_queue_lock, flags);
> +	list_splice_init(&split_queue, &list);
> +
> +	/* Take pin on all head pages to avoid freeing them under us */
> +	list_for_each_safe(pos, next, &list) {
> +		page = list_entry((void *)pos, struct page, mapping);
> +		page = compound_head(page);
> +		/* race with put_compound_page() */
> +		if (!get_page_unless_zero(page)) {
> +			list_del_init(page_deferred_list(page));
> +			split_queue_len--;
> +		}
> +	}
> +	spin_unlock_irqrestore(&split_queue_lock, flags);

While rebasing I noticed this loop looks a bit too heavy. There's no
lockbreak and no cap on the list size, and million of THP pages could
have been partially unmapped but not be entirely freed yet, and sit
there for a while (there are other scenarios but this is the one that
could more realistically happen with certain allocators). Then as
result of random memory pressure we'd be calling millions of
get_page_unless_zero across multiple NUMA nodes thrashing cachelines
at every list entry, with irq disabled too for the whole period.

I haven't verified it, but I guess that in some large NUMA (i.e. 4TiB)
system that could take down a CPU for a second or more with irq
disabled.

I think it needs to isolate a certain number of pages, not splice
(userland programs can invoke the shrinker through direct reclaim too
and they can't stuck there for too long) and perhaps use
sc->nr_to_scan to achieve that.

The split_queue can also be moved from global to the "struct
pglist_data" and then you can do NODE_DATA(sc->nid)->split_queue, same
for the spinlock. That will make it more scalable for the lock and
more efficient in freeing memory so we don't split THP from nodes
reclaim isn't currently interested about (reclaim will later try again
on the zones in the other nodes by itself if needed).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>