On Tue, Feb 4, 2025 at 8:31 PM Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote: > > Vishal Annapurve <vannapurve@xxxxxxxxxx> writes: > > > On Thu, Jan 23, 2025 at 1:51 AM Fuad Tabba <tabba@xxxxxxxxxx> wrote: > >> > >> On Wed, 22 Jan 2025 at 22:16, Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote: > >> > > >> > Fuad Tabba <tabba@xxxxxxxxxx> writes: > >> > > >> > Hey Fuad, I'm still working on verifying all this but for now this is > >> > one issue. I think this can be fixed by checking if the folio->mapping > >> > is NULL. If it's NULL, then the folio has been disassociated from the > >> > inode, and during the dissociation (removal from filemap), the > >> > mappability can also either > >> > > >> > 1. Be unset so that the default mappability can be set up based on > >> > GUEST_MEMFD_FLAG_INIT_MAPPABLE, or > >> > 2. Be directly restored based on GUEST_MEMFD_FLAG_INIT_MAPPABLE > >> > >> Thanks for pointing this out. I hadn't considered this case. I'll fix > >> in the respin. > >> > > > > Can the below scenario cause trouble? > > 1) Userspace converts a certain range of guest memfd as shared and > > grabs some refcounts on shared memory pages through existing kernel > > exposed mechanisms. > > 2) Userspace converts the same range to private which would cause the > > corresponding mappability attributes to be *MAPPABILITY_NONE. > > 3) Userspace truncates the range which will remove the page from pagecache. > > 4) Userspace does the fallocate again, leading to a new page getting > > allocated without freeing the older page which is still refcounted > > (step 1). > > > > Effectively this could allow userspace to keep allocating multiple > > pages for the same guest_memfd range. > > I'm still verifying this but for now here's the flow Vishal described in > greater detail: > > + guest_memfd starts without GUEST_MEMFD_FLAG_INIT_MAPPABLE > + All new pages will start with mappability = GUEST > + guest uses a page > + Get new page > + Add page to filemap > + guest converts page to shared > + Mappability is now ALL > + host uses page > + host takes transient refcounts on page > + Refcount on the page is now (a) filemap's refcount (b) vma's refcount > (c) transient refcount > + guest converts page to private > + Page is unmapped > + Refcount on the page is now (a) filemap's refcount (b) transient > refcount > + Since refcount is elevated, the mappabilities are left as NONE > + Filemap's refcounts are removed from the page > + Refcount on the page is now (a) transient refcount > + host punches hole to deallocate page > + Since mappability was NONE, restore filemap's refcount > + Refcount on the page is now (a) transient refcount (b) filemap's > refcount > + Mappabilities are reset to GUEST for truncated range > + Folio is removed from filemap > + Refcount on the page is now (a) transient refcount > + Callback remains registered so that when the transient refcounts are > dropped, cleanup can happen - this is where merging will happen > with 1G page support > + host fallocate()s in the same address range > + will get a new page > > Though the host does manage to get a new page while the old one stays > around, I think this is working as intended, since the transient > refcounts are truly holding the old folio around. When the transient > refcounts go away, the old folio will still get cleaned up (with 1G page > support: merged and returned) to as expected. The new page will also be > freed at some point later. > > If the userspace program decides to keep taking transient refcounts to hold > pages around, then the userspace program is truly leaking memory and it > shouldn't be guest_memfd's bug. I wouldn't call such references transient. But a similar scenario is applicable for shmem files so it makes sense to call out this behavior as WAI.