On Thu, Sep 17, 2020 at 12:42:11PM -0700, Linus Torvalds wrote: > Because the whole "do page pinning without MADV_DONTFORK and then fork > the area" is I feel a very very invalid load. It sure as hell isn't > something we should care about performance for, and in fact it is > something we should very well warn for exactly to let people know > "this process is doing bad things". It is easy for things like iouring that can just allocate the queue memory they care about and MADV_DONTFORK it. Other things work more like O_DIRECT - the data it is working on is arbtiary app memory, not controlled in anyway. In RDMA we have this ugly scheme were we automatically call MADV_DONTFORK on the virtual address and hope it doesn't explode. It is very hard to call MADV_DONTFORK if you don't control the allocation. Don't want to break huge pages, have to hope really really hard that a fork doesn't need that memory. Hope you don't run out of vmas beause it causes a vma split. So ugly. So much overhead. Considering almost anything can do a fork() - we've seen app authors become confused. They say stuff is busted, support folks ask if they use fork, author says no.. Investigation later shows some hidden library did system() or whatever. In this case the tests that found this failed because they were written in Python and buried in there was some subprocess.call(). I would prefer the kernel consider it a valid work load with the semantics the sketch patch shows.. > Is there possibly somethign else we can filter on than just > GUP_PIN_COUNTING_BIAS? Because it could be as simple as just marking > the vma itself and saying "this vma has had a page pinning event done > on it". We'd have to give up pin_user_pages_fast() to do that as we can't fast walk and get vmas? Hmm, there are many users. I remember that the hfi1 folks really wanted the fast version for some reason.. > Because if we only start copying the page *iff* the vma is marked by > that "this vma had page pinning" _and_ the page count is bigger than > GUP_PIN_COUNTING_BIAS, than I think we can rest pretty easily knowing > that we aren't going to hit some regular old-fashioned UNIX server > cases with a lot of forks.. Agree Given that this is a user visible regression, it is nearly rc6, what do you prefer for next steps? Sorting out this for fork, especially if it has the vma change is probably more than a weeks time. Revert this patch and try again next cycle? Thanks, Jason