On Mon, Jun 24, 2019 at 10:34:02AM +0000, qi.fuli@xxxxxxxxxxx wrote: > On 6/18/19 2:03 AM, Will Deacon wrote: > > On Mon, Jun 17, 2019 at 11:32:53PM +0900, Takao Indoh wrote: > >> From: Takao Indoh <indou.takao@xxxxxxxxxxx> > >> > >> I found a performance issue related on the implementation of Linux's TLB > >> flush for arm64. > >> > >> When I run a single-threaded test program on moderate environment, it > >> usually takes 39ms to finish its work. However, when I put a small > >> apprication, which just calls mprotest() continuously, on one of sibling > >> cores and run it simultaneously, the test program slows down significantly. > >> It becomes 49ms(125%) on ThunderX2. I also detected the same problem on > >> ThunderX1 and Fujitsu A64FX. > > This is a problem for any applications that share hardware resources with > > each other, so I don't think it's something we should be too concerned about > > addressing unless there is a practical DoS scenario, which there doesn't > > appear to be in this case. It may be that the real answer is "don't call > > mprotect() in a loop". > I think there has been a misunderstanding, please let me explain. > This application is just an example using for reproducing the > performance issue we found. > Our original purpose is reducing OS jitter by this series. > The OS jitter on massively parallel processing systems have been known > and studied for many years. > The 2.5% OS jitter can result in over a factor of 20 slowdown for the > same application [1]. I think it's worth pointing out that the system in question was neither ARM-based nor running Linux, so I'd be cautious in applying the conclusions of that paper directly to our TLB invalidation code. Furthermore, the noise being generated in their experiments uses a timer interrupt, which has a /vastly/ different profile to a DVM message in terms of both system impact and frequency. > Though it may be an extreme example, reducing the OS jitter has been an > issue in HPC environment. > > [1] Ferreira, Kurt B., Patrick Bridges, and Ron Brightwell. > "Characterizing application sensitivity to OS interference using > kernel-level noise injection." Proceedings of the 2008 ACM/IEEE > conference on Supercomputing. IEEE Press, 2008. > > >> I suppose the root cause of this issue is the implementation of Linux's TLB > >> flush for arm64, especially use of TLBI-is instruction which is a broadcast > >> to all processor core on the system. In case of the above situation, > >> TLBI-is is called by mprotect(). > > On the flip side, Linux is providing the hardware with enough information > > not to broadcast to cores for which the remote TLBs don't have entries > > allocated for the ASID being invalidated. I would say that the root cause > > of the issue is that this filtering is not taking place. > > Do you mean that the filter should be implemented in hardware? Yes. If you're building a large system and you care about "jitter", then you either need to partition it in such a way that sources of noise are contained, or you need to introduce filters to limit their scope. Rewriting the low-level memory-management parts of the operating system is a red herring and imposes a needless burden on everybody else without solving the real problem, which is that contended use of shared resources doesn't scale. Will