On Wed, Feb 8, 2017 at 4:10 PM, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > On Wed, Feb 08, 2017 at 12:51:24PM -0800, Andy Lutomirski wrote: >> On Tue, Feb 7, 2017 at 10:56 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: >> > Quite a few people have expressed interest in enabling PCID on (x86) >> > Linux. Here's the code: >> > >> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid >> > >> > The main hold-up is that the code needs to be reviewed very carefully. >> > It's quite subtle. In particular, "x86/mm: Try to preserve old TLB >> > entries using PCID" ought to be looked at carefully to make sure the >> > locking is right, but there are plenty of other ways this this could >> > all break. >> > >> > Anyone want to take a look or maybe scare up some other reviewers? >> > (Kees, you seemed *really* excited about getting this in.) >> >> Nadav pointed out that this doesn't work right with >> ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH. Mel, here's the issue: >> > > I confess I didn't read the thread or patch, I'm only looking at this > question. That's fine with me :) > >> I want to add ASID (Intel calls it PCID) support to x86. This means >> that "flush the TLB on a given CPU" will no longer be a particularly >> well defined operation because it's not clear which ASID tag to flush. >> Instead there's "flush the TLB for a given mm on a given CPU". >> > > Understood. > >> If I'm understanding the batched flush code, all it's trying to do is >> to flush more than one mm at a time. > > More or less. It would be more accurate to say it's flushing any CPU TLB > that potentially accessed a list of pages recently. It's not flushing > "multiple MMs at a time" as such, it's flushing "only CPUs that accessed > pages mapped by a mm recently". The distinction may not matter. > > The requirements do matter though. > > mm/vmscan.c is where TTU_BATCH_FLUSH is set and that is processing a list > of pages that can be mapped by multiple MMs. > > The TLBs must be flushed before either IO starts (try_to_unmap_flush_dirty) > or they are freed to the page allocator (try_to_unmap_flush). > > To do this, all it has to track is a simple mask of CPUs, whether a flush > is necessary and whether any of the PTEs were dirty. This is trivial to > assemble during an rmap walk as it's a PTE check and a cpumask_or. > > try_to_unmap_flush then flushes the entire TLB as the cost of targetted > a specific page to flush was so high (both maintaining the PFNs and the > individual flush operations). I could just maybe make it possible to remotely poke a CPU to record which mms need flushing, but the possible races there are a bit terrifying. > >> Would it make sense to add a new >> arch API to flush more than one mm? Presumably it would take a linked >> list, and the batched flush code would fall back to flushing in pieces >> if it can't allocate a new linked list node when needed. >> > > Conceptually it's ok but the details are a headache. > > The defer code would need to maintain a list of mm's (or ASIDs) that is > unbounded in size to match the number of IPIs sent as the current code as > opposed to a simple cpumask. There are SWAP_CLUSTER_MAX pages to consider > with each page potentially mapped by an arbitrary number of MMs. The same > mm's could exist on multiple lists for each active kswapd instance and > direct reclaimer. > > As multiple reclaimers are interested in the same mm, that pretty much > rules out linking them off mm_struct unless the locking would serialise > the parallel reclaimers and prevent an mm existing on more than one list > at a time. You could try allowing multiple tasks to share the one list > (not sure how to find that list quickly) but each entry would have to > be locked and as each user can flush at any time, multiple reclaimers > potentially have to block while an IPI is being sent. It's hard to see > how this could be scaled to match the existing code. > > It would be easier to track via an array stored in task_struct but the > number of MMs is unknown in advance so all you can do is guess a reasonable > size. It would have to flush if the array files resulting in more IPIs > than the current code depending on how many MMs map the list of pages. What if I just allocate a smallish array on the stack and then extend with kmalloc(GFP_ATOMIC) as needed? An allocation failure would just force an immediate flush, so there shouldn't be any deadlock risk. Anyway, I need to rework the arch code to make this work at all. Currently I'm taking a spinlock per mm when flushing that mm, but that would mean I need to *sort* the list to flush more than one at a time, and that just sounds nasty. I can probably get rid of the spinlock. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>