On Fri, Feb 10, 2017 at 1:57 PM, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Feb 10, 2017 at 08:44:04AM -0800, Andy Lutomirski wrote: >> On Fri, Feb 10, 2017 at 3:01 AM, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: >> > On Thu, Feb 09, 2017 at 06:46:57PM -0800, Andy Lutomirski wrote: >> >> > try_to_unmap_flush then flushes the entire TLB as the cost of targetted >> >> > a specific page to flush was so high (both maintaining the PFNs and the >> >> > individual flush operations). >> >> >> >> I could just maybe make it possible to remotely poke a CPU to record >> >> which mms need flushing, but the possible races there are a bit >> >> terrifying. >> >> >> > >> > The overhead is concerning. You may incur a remote cache miss accessing the >> > data which is costly or you have to send an IPI which is also severe. You >> > could attempt to do the same as the scheduler and directly modify if the >> > CPUs share cache and IPI otherwise but you're looking at a lot of overhead >> > either way. >> >> I think all of these approaches suck and I'll give up on this particular avenue. >> > > Ok, probably for the best albeit that is based on an inability to figure > out how it could be done efficiently and a suspicion that if it could be > done, the scheduler would be doing it already. > FWIW, I am doing a bit of this. For remote CPUs that aren't currently running a given mm, I just bump a per-mm generation count so that they know to flush next time around in switch_mm(). I'll need to add a new hook to the batched flush code to get this right, and I'll cc you on that. Stay tuned. > It's possible that covering all of this is overkill but it's the avenues > of concern I'd expect if I was working on ASID support. Agreed. > > [1] I could be completely wrong, I'm basing this on how people have > behaved in the past during TLB-flush related discussions. They > might have changed their mind. We'll see. The main benchmark that I'm relying on (so far) is that context switches get way faster, just ping ponging back and forth. I suspect that the TLB refill cost is only a small part. > > [2] This could be covered already in the specifications and other > discussions. Again, I didn't actually look into what's truly new with > the Intel ASID. I suspect I could find out how many ASIDs there really are under NDA, but even that would be challenging and only dubiously useful. For now, I'm using a grand total of four ASIDs. :) > >> > I recognise that you'll be trying to balance this against processes >> > that are carefully isolated that do not want interference from unrelated >> > processes doing a TLB flush but it'll be hard to prove that it's worth it. >> > >> > It's almost certain that this will be Linus' primary concern >> > given his contributions to similar conversations in the past >> > (e.g. https://lkml.org/lkml/2015/6/25/666). It's also likely to be of >> > major concern to Ingo (e.g. https://lkml.org/lkml/2015/6/9/276) as he had >> > valid objections against clever flushing at the time the batching was >> > introduced. Based on previous experience, I have my own concerns but I >> > don't count as I'm highlighing them now :P >> >> I fully agree with those objections, but back then we didn't have the >> capability to avoid a flush when switching mms. >> > > True, so watch for questions on what the odds are of switching an MM will > flush the TLB information anyway due to replacement policies. > >> > >> > The outcome of the TLB batch flushiing discussion was that it was way >> > cheaper to flush the full TLB and take the refill cost than flushing >> > individual pages which had the cost of tracking the PFNs and the cost of >> > each individual page flush operation. >> > >> > The current code is basically "build a cpumask and flush the TLB for >> > multiple entries". We're talking about complex tracking of mm's with >> > difficult locking, potential remote cache misses, potentially more IPIs or >> > alternatively doing allocations from reclaim context. It'll be difficult >> > to prove that doing this in the name of flushing ASID is cheaper and >> > universally a good idea than just flushing the entire TLB. >> > >> >> Maybe there's a middle ground. I could keep track of whether more >> than one mm is targetted in a deferred flush and just flush everything >> if so. > > That would work and side-steps much of the state tracking concerns. It > might even be a good fit for use cases like "limited number of VMs on a > machine" or "one major application that must be isolated and some admin > processes with little CPU time or kthreads" because you don't want to get > burned with the "only a microbenchmark sees any benefit" hammer[3]. > >> As a future improvement, I or someone else could add: >> >> struct mm_struct *mms[16]; >> int num_mms; >> >> to struct tlbflush_unmap_batch. if num_mms > 16, then this just means >> that we've given up on tracking them all and we do the global flush, >> and, if not, we could teach the IPI handler to understand a list of >> target mms. > > I *much* prefer a fallback of a full flush than kmallocing additional > space. It's also something that feasibly could be switchable at runtime with > a union of cpumask and an array of mms depending on the CPU capabilities with > static branches determining which is used to minimise overhead. That would > have only minor overhead and with a debugging patch could allow switching > between them at boot-time for like-like comparisons on a range of workloads. Sounds good. This means I need to make my code understand the concept of a full flush, but that's manageable. > > [3] Can you tell I've been burned a few times by the "only > microbenchmarks care" feedback? > :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>