On 9/11/19 6:03 PM, Mike Kravetz wrote: > On 9/11/19 8:44 AM, Waiman Long wrote: >> On 9/11/19 4:14 PM, Matthew Wilcox wrote: >>> On Wed, Sep 11, 2019 at 04:05:37PM +0100, Waiman Long wrote: >>>> When allocating a large amount of static hugepages (~500-1500GB) on a >>>> system with large number of CPUs (4, 8 or even 16 sockets), performance >>>> degradation (random multi-second delays) was observed when thousands >>>> of processes are trying to fault in the data into the huge pages. The >>>> likelihood of the delay increases with the number of sockets and hence >>>> the CPUs a system has. This only happens in the initial setup phase >>>> and will be gone after all the necessary data are faulted in. >>> Can;t the application just specify MAP_POPULATE? >> Originally, I thought that this happened in the startup phase when the >> pages were faulted in. The problem persists after steady state had been >> reached though. Every time you have a new user process created, it will >> have its own page table. > This is still at fault time. Although, for the particular application it > may be after the 'startup phase'. > >> It is the sharing of the of huge page shared >> memory that is causing problem. Of course, it depends on how the >> application is written. > It may be the case that some applications would find the delays acceptable > for the benefit of shared pmds once they reach steady state. As you say, of > course this depends on how the application is written. > > I know that Oracle DB would not like it if PMD sharing is disabled for them. > Based on what I know of their model, all processes which share PMDs perform > faults (write or read) during the startup phase. This is in environments as > big or bigger than you describe above. I have never looked at/for delays in > these environments around pmd sharing (page faults), but that does not mean > they do not exist. I will try to get the DB group to give me access to one > of their large environments for analysis. > > We may want to consider making the timeout value and disable threshold user > configurable. Making it configurable is certainly doable. They can be sysctl parameters so that the users can reenable PMD sharing by making those parameters larger. Cheers, Longman