Hi Will, Thanks for your comments. On 6/27/19 7:27 PM, Will Deacon wrote: > On Mon, Jun 24, 2019 at 10:34:02AM +0000, qi.fuli@xxxxxxxxxxx wrote: >> On 6/18/19 2:03 AM, Will Deacon wrote: >>> On Mon, Jun 17, 2019 at 11:32:53PM +0900, Takao Indoh wrote: >>>> From: Takao Indoh <indou.takao@xxxxxxxxxxx> >>>> >>>> I found a performance issue related on the implementation of Linux's TLB >>>> flush for arm64. >>>> >>>> When I run a single-threaded test program on moderate environment, it >>>> usually takes 39ms to finish its work. However, when I put a small >>>> apprication, which just calls mprotest() continuously, on one of sibling >>>> cores and run it simultaneously, the test program slows down significantly. >>>> It becomes 49ms(125%) on ThunderX2. I also detected the same problem on >>>> ThunderX1 and Fujitsu A64FX. >>> This is a problem for any applications that share hardware resources with >>> each other, so I don't think it's something we should be too concerned about >>> addressing unless there is a practical DoS scenario, which there doesn't >>> appear to be in this case. It may be that the real answer is "don't call >>> mprotect() in a loop". >> I think there has been a misunderstanding, please let me explain. >> This application is just an example using for reproducing the >> performance issue we found. >> Our original purpose is reducing OS jitter by this series. >> The OS jitter on massively parallel processing systems have been known >> and studied for many years. >> The 2.5% OS jitter can result in over a factor of 20 slowdown for the >> same application [1]. > I think it's worth pointing out that the system in question was neither > ARM-based nor running Linux, so I'd be cautious in applying the conclusions > of that paper directly to our TLB invalidation code. Furthermore, the noise > being generated in their experiments uses a timer interrupt, which has a > /vastly/ different profile to a DVM message in terms of both system impact > and frequency. My original purpose was to explain that the OS jitter is a vital issue for large-scale HPC environment by referencing this paper. Please allow me to introduce the issue that had occurred to our HPC environment. We used FWQ [1] to do an experiment on 1 node of our HPC environment, we expected it would be tens of microseconds of maximum OS jitter, but it was hundreds of microseconds, which didn't meet our requirement. We tried to find out the cause by using ftrace, but we cannot find any processes which would cause noise and only knew the extension of processing time. Then we confirmed the CPU instruction count through CPU PMU, we also didn't find any changes. However, we found that with the increase of that the TLB flash was called, the noise was also increasing. Here we understood that the cause of this issue is the implementation of Linux's TLB flush for arm64, especially use of TLBI-is instruction which is a broadcast to all processor core on the system. Therefore, we made this patch set to fix this issue. After testing for several times, the noise was reduced and our original goal was achieved, so we do think this patch makes sense. As I mentioned, the OS jitter is a vital issue for large-scale HPC environment. We tried a lot of things to reduce the OS jitter. One of them is task separation between the CPUs which are used for computing and the CPUs which are used for maintenance. All of the daemon processes and I/O interrupts are bounden to the maintenance CPUs. Further more, we used nohz_full to avoid the noise caused by computing CPU interruption, but all of the CPUs were affected by TLBI-is instruction, the task separation of CPUs didn't work. Therefore, we would like to implement that TLB flush is done on minimal CPUs to reducing the OS jitter by using this patch set. [1] https://asc.llnl.gov/sequoia/benchmarks/FTQ_summary_v1.1.pdf Thanks, QI Fuli >> Though it may be an extreme example, reducing the OS jitter has been an >> issue in HPC environment. >> >> [1] Ferreira, Kurt B., Patrick Bridges, and Ron Brightwell. >> "Characterizing application sensitivity to OS interference using >> kernel-level noise injection." Proceedings of the 2008 ACM/IEEE >> conference on Supercomputing. IEEE Press, 2008. >> >>>> I suppose the root cause of this issue is the implementation of Linux's TLB >>>> flush for arm64, especially use of TLBI-is instruction which is a broadcast >>>> to all processor core on the system. In case of the above situation, >>>> TLBI-is is called by mprotect(). >>> On the flip side, Linux is providing the hardware with enough information >>> not to broadcast to cores for which the remote TLBs don't have entries >>> allocated for the ASID being invalidated. I would say that the root cause >>> of the issue is that this filtering is not taking place. >> Do you mean that the filter should be implemented in hardware? > Yes. If you're building a large system and you care about "jitter", then > you either need to partition it in such a way that sources of noise are > contained, or you need to introduce filters to limit their scope. Rewriting > the low-level memory-management parts of the operating system is a red > herring and imposes a needless burden on everybody else without solving > the real problem, which is that contended use of shared resources doesn't > scale. > > Will