On Fri, Sep 18, 2020 at 9:40 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > Firstly in the draft patch mm->has_pinned is introduced and it's written to 1 > as long as FOLL_GUP is called once. It's never reset after set. That's fine. That was what I was expecting you to do. It only needs to be cleared at mm creation time (fork/exec), and I see you added that in mm_init() already. Since this only matters for fork(), and is really just a filter for the (very) special behavior, and people who _actually_ do the page pinning won't likely be mixing that with thousands of forks anyway, it really doesn't matter. It's literally just about "I'm a normal process, I've never done any special rdma page pinning, let me fork a lot with no overhead" flag. The only change I'd make is to make it a "int" and put it next to the "int map_count", since that will pack better on 64-bit (assuming you don't do the randomize_layout thing, in which case it's all moot). Even if we were to expand it to an actual page count, I'm not convinced we'd ever want anybody to pin more than 2 billion pages. That's a minimum of 8 TB of RAM. Even if that were physically possibly on some machines, it doesn't seem reasonable. So even if that flag were to ever become an actual count, more than 32 bits seems pointless and wrong. And as a flag, it most certainly doesn't need "unsigned long". > One more thing (I think) we need to do is to pass the new vma from > copy_page_range() down into the end because if we want to start cow during > fork() then we need to operate on that new vma too when new page linked to it > rather than the parent's. Ahh. Because you pass the new vma down to the rmap routines. I actually think it's unnecessary, because all the rmap routines *really* care about is not the vma, but the anonvma associated with it. Which will be the same for the parent and the child. But we'd probably have to change the calling convention for rmap for that to be obvious, so your solution seems ok. Maybe not optimal, but I think we're going for "let's make things as clear as possible" rather than optimal right now. My main worry here is that it makes the calls really ugly, and we generally try to avoid having that many arguments, but it was bad before, and these are generally inlined, so changing it to use a argument structure wouldn't even help code generation. So it's not pretty. But it is what it is. > One issue is when we charge for cgroup we probably can't do that onto the new > mm/task, since copy_namespaces() is called after copy_mm(). That cannot possibly matter as far as I can see. Copying the page in between those two calls is already possible since we've already dropped the mmap_lock by the time copy_namespaces() is called. So if the parent was threaded, and another thread did a write access, the parent would have caused that COW that we did early. And any page copying cost should be to the parent anyway, since that is who did the pinning that caused the copy in the first place. So for both of those reasons - the COW can already happen between copy_mm() and copy_namespaces(), *and* charging it to the parent namespace is proper anyway - I think your cgroup worry is not relevant. I'm not even sure anything relevant to accounting is created, but my point is that even if it is, I don't see how it could be an issue. > The other thing is on how to fail. E.g., when COW failed due to either > charging of cgroup or ENOMEM, ideally we should fail fork() too. Though that > might need more changes - current patch silently kept the shared page for > simplicity. We already can fail forkind due to memory allocations failing. Again, not an issue. It happens. The only real thing to worry about would be that this doesn't affect normal programs, and that mm flag takes care of that. Linus