Hi John, Thanks for the review and suggestions. On Sun, Aug 2, 2020 at 8:28 PM John Hubbard <jhubbard@xxxxxxxxxx> wrote: > > On 7/31/20 1:32 PM, Peter Collingbourne wrote: > ... > > Hi, > > I can see why you want to do this. A few points to consider, below. > > btw, the patch would *not* apply for me, via `git am`. I finally used > patch(1) and that worked. Probably good to mention which tree and branch > this applies to, as a first step to avoiding that, but I'm not quite sure > what else went wrong. Maybe use stock git, instead of > 2.28.0.163.g6104cc2f0b6-goog? Just guessing. Sorry about that. It might have been because I had another patch applied underneath this one when I created the patch. In the v2 that I'm about to send I'm based directly on master. > > @@ -1684,9 +1695,33 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) > > return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE; > > } > > > > +static vm_fault_t refpage_fault(struct vm_fault *vmf) > > +{ > > + struct page *page; > > + > > + if (get_user_pages((unsigned long)vmf->vma->vm_private_data, 1, 0, > > + &page, 0) != 1) > > + return VM_FAULT_SIGSEGV; > > + > > This will end up overflowing the page->_refcount in some situations. > > Some thoughts: > > In order to implement this feature, the reference pages need to be made > at least a little bit more special, and probably little bit more like > zero pages. At one extreme, for example, zero pages could be a special > case of reference pages, although I'm not sure of a clean way to > implement that. > > > The reason that more special-ness is required, is that things such as > reference counting and locking can be special-cased with zero pages. > Doing so allows avoiding page->_refcount overflows, for example. Your > patch here, however, allows normal pages to be treated *almost* like a > zero page, in that it's a page full of constant value data. But because > a refpage can be any page, not just a special one that is defined at a > single location, that leads to problems with refcounts. You're right, there is a potential reference count issue here. But it looks like the issue is not with _refcount but with _mapcount. For example, a program could create a reference page mapping with 2^32 pages, fault every page in the mapping and thereby overflow _mapcount. It looks like we can avoid this issue by aligning the handling of reference pages with that of the zero page, as you suggested. Like the zero page, _mapcount is now not modified on reference pages to track PTEs (this is done by causing vm_normal_page() to return null for these pages, as we do for the zero page). Ownership is moved to the struct file created by the new refpage_create (bikeshed colors welcome) syscall which returns a file descriptor, per Kirill's suggestion. A struct file's reference count is an atomic_long_t, which I assume cannot realistically overflow. A pointer to the reference page is stored in the VMA's vm_private_data, but this is mostly for convenience because the page is kept alive by the VMA's struct file reference. The VMA's vm_ops is now set to null, which causes us to follow the code path for anonymous pages, which has been modified to handle reference pages. That's all implemented in the v2 that I'm about to send. I considered having reference page mappings continue to provide a custom vm_ops, but this would require changes to the interface to preserve the specialness of the reference page. For example, vm_normal_page() would need to know to return null for the reference page in order to prevent _mapcount from overflowing, which could probably be done by adding a new interface to vm_ops, but that seemed more complicated than changing the anonymous page code path. > > + vmf->page = page; > > + return VM_FAULT_LOCKED; > > Is the page really locked, or is this a case of "the page is special and > we can safely claim it is locked"? Maybe I'm just confused about the use > of VM_FAULT_LOCKED: I thought you only should set it after locking the > page. You're right, it isn't locked at this point. I had confused locking the page with incrementing its _refcount via get_user_pages(). But with the new implementation we no longer need this fault handler. > > +} > > + > > +static void refpage_close(struct vm_area_struct *vma) > > +{ > > + /* This function exists only to prevent is_mergeable_vma from allowing a > > + * reference page mapping to be merged with an anonymous mapping. > > + */ > > While it is true that implementing a vma's .close() method will prevent > vma merging, this is an abuse of that function: it depends on how that > function is implemented. And given that refpages represent significant > new capability, I think they deserve their own "if" clause (and perhaps > a VMA flag) in is_mergeable_vma(), instead of this kind of minor hack. It turns out that with the change to use a file descriptor we do not need a change to is_mergeable_vma() because the function bails out if the struct file pointers in the VMAs are different. Thanks, Peter