Re: PCID review?

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Thu, 9 Feb 2017 00:10:42 +0000

On Wed, Feb 08, 2017 at 12:51:24PM -0800, Andy Lutomirski wrote:
> On Tue, Feb 7, 2017 at 10:56 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> > Quite a few people have expressed interest in enabling PCID on (x86)
> > Linux.  Here's the code:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
> >
> > The main hold-up is that the code needs to be reviewed very carefully.
> > It's quite subtle.  In particular, "x86/mm: Try to preserve old TLB
> > entries using PCID" ought to be looked at carefully to make sure the
> > locking is right, but there are plenty of other ways this this could
> > all break.
> >
> > Anyone want to take a look or maybe scare up some other reviewers?
> > (Kees, you seemed *really* excited about getting this in.)
> 
> Nadav pointed out that this doesn't work right with
> ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.  Mel, here's the issue:
> 

I confess I didn't read the thread or patch, I'm only looking at this
question.

> I want to add ASID (Intel calls it PCID) support to x86.  This means
> that "flush the TLB on a given CPU" will no longer be a particularly
> well defined operation because it's not clear which ASID tag to flush.
> Instead there's "flush the TLB for a given mm on a given CPU".
> 

Understood.

> If I'm understanding the batched flush code, all it's trying to do is
> to flush more than one mm at a time. 

More or less. It would be more accurate to say it's flushing any CPU TLB
that potentially accessed a list of pages recently.  It's not flushing
"multiple MMs at a time" as such, it's flushing "only CPUs that accessed
pages mapped by a mm recently". The distinction may not matter.

The requirements do matter though.

mm/vmscan.c is where TTU_BATCH_FLUSH is set and that is processing a list
of pages that can be mapped by multiple MMs.

The TLBs must be flushed before either IO starts (try_to_unmap_flush_dirty)
or they are freed to the page allocator (try_to_unmap_flush).

To do this, all it has to track is a simple mask of CPUs, whether a flush
is necessary and whether any of the PTEs were dirty. This is trivial to
assemble during an rmap walk as it's a PTE check and a cpumask_or.

try_to_unmap_flush then flushes the entire TLB as the cost of targetted
a specific page to flush was so high (both maintaining the PFNs and the
individual flush operations).

> Would it make sense to add a new
> arch API to flush more than one mm?  Presumably it would take a linked
> list, and the batched flush code would fall back to flushing in pieces
> if it can't allocate a new linked list node when needed.
> 

Conceptually it's ok but the details are a headache.

The defer code would need to maintain a list of mm's (or ASIDs) that is
unbounded in size to match the number of IPIs sent as the current code as
opposed to a simple cpumask. There are SWAP_CLUSTER_MAX pages to consider
with each page potentially mapped by an arbitrary number of MMs. The same
mm's could exist on multiple lists for each active kswapd instance and
direct reclaimer.

As multiple reclaimers are interested in the same mm, that pretty much
rules out linking them off mm_struct unless the locking would serialise
the parallel reclaimers and prevent an mm existing on more than one list
at a time. You could try allowing multiple tasks to share the one list
(not sure how to find that list quickly) but each entry would have to
be locked and as each user can flush at any time, multiple reclaimers
potentially have to block while an IPI is being sent. It's hard to see
how this could be scaled to match the existing code.

It would be easier to track via an array stored in task_struct but the
number of MMs is unknown in advance so all you can do is guess a reasonable
size. It would have to flush if the array files resulting in more IPIs
than the current code depending on how many MMs map the list of pages.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>