On Tue, Sep 15, 2020 at 05:33:30PM -0400, Peter Xu wrote: > > RDMA doesn't ever use !WRITE > > > > I'm guessing #5 is the issue, just with a different ordering. If the > > #3 pin_user_pages() preceeds the #2 fork, don't we get to the same #5? > > Right, but only if without MADV_DONTFORK? Yes, results are that MADV_DONTFORK resolves the issue for initial tests. I should know if a bigger test suite passes in a few days. > > > If this is a problem, we may still need the fix patch (maybe not as urgent as > > > before at least). But I'd like to double confirm, just in case I miss some > > > obvious facts above. > > > > I'm worred that the sudden need to have MAD_DONTFORK is going to be a > > turn into a huge regression. It already blew up our first level of > > synthetic test cases. I'm worried what we will see when the > > application suite is run in a few months :\ > > For my own preference I'll consider changing kernel behavior if the impact is > still under control (the performance report of 30%+ boost is also attractive > after the simplify-cow patch). The other way is to maintain the old reuse > logic forever, that'll be another kind of burden. Seems no easy way on either > side... It seems very strange that a physical page exclusively owned by a process can become copied if pin_user_pages() is active and the process did fork() at some point. Could the new pin_user_pages() logic help here? eg the GUP_PIN_COUNTING_BIAS stuff? Could the COW code consider a refcount of GUP_PIN_COUNTING_BIAS + 1 as being owned by the current mm and not needing COW? The DMA pin would be 'invisible' for COW purposes? Perhaps an ideal thing would be to have fork() not set the write protect on pages that have GUP_PIN_COUNTING_BIAS (ie are under DMA), instead immediately copy during fork(). Then we don't get into this situation and also don't need MADV_DONTFORK anymore (yay!!). Feels like this could be low cost as fork() must already be touching the refcount? It looks like RDMA, media, vfio, vdpa, io_uring (!!) and xdp all use FOLL_LONGTERM and may be at regression risk. I can't say at this point the scope of the problem with RDMA. *Technicaly* apps forking without MADV_DONTFORK are against the defined programming model, but since the old kernel didn't fail robustly there could be misses. FWIW the failure is catastrophic, the app just breaks completely. io_uring seems like something that would have interest to mix with fork.. I see mentions of MADV_DONTFORK in io_uring documentation, however it is not documented as a 'if you ever call fork() you have to use this API'. That seems worrying. > > > IMHO it worked because the page to do RDMA has mapcount==1, so it was reused > > > previously just as-is even after the fork without MADV_DONTFORK and after the > > > child quits. > > > > That would match the results we see.. So this patch changes things so > > it is not re-used as-is, but replaced with Y? > > Yes. The patch lets "replaced with Y" (cow) happen earlier at step #3. Then > with MADV_DONTFORK, reuse should not happen any more. ?? Step #3 is pin_user_pages(), why would that replace with COW? Thanks, Jason