Re: [RFC][Patch v12 1/2] mm: page_reporting: core infrastructure

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Wed, 14 Aug 2019 09:11:09 -0700

On Wed, Aug 14, 2019 at 8:49 AM Nitesh Narayan Lal <nitesh@xxxxxxxxxx> wrote:
>
>
> On 8/12/19 2:47 PM, Alexander Duyck wrote:
> > On Mon, Aug 12, 2019 at 6:13 AM Nitesh Narayan Lal <nitesh@xxxxxxxxxx> wrote:
> >> This patch introduces the core infrastructure for free page reporting in
> >> virtual environments. It enables the kernel to track the free pages which
> >> can be reported to its hypervisor so that the hypervisor could
> >> free and reuse that memory as per its requirement.
> >>
> >> While the pages are getting processed in the hypervisor (e.g.,
> >> via MADV_DONTNEED), the guest must not use them, otherwise, data loss
> >> would be possible. To avoid such a situation, these pages are
> >> temporarily removed from the buddy. The amount of pages removed
> >> temporarily from the buddy is governed by the backend(virtio-balloon
> >> in our case).
> >>
> >> To efficiently identify free pages that can to be reported to the
> >> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >> chunks are reported to the hypervisor - especially, to not break up THP
> >> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >> in the bitmap are an indication whether a page *might* be free, not a
> >> guarantee. A new hook after buddy merging sets the bits.
> >>
> >> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >> asynchronously processes the bitmaps, trying to isolate and report pages
> >> that are still free. The backend (virtio-balloon) is responsible for
> >> reporting these batched pages to the host synchronously. Once reporting/
> >> freeing is complete, isolated pages are returned back to the buddy.
> >>
> >> Signed-off-by: Nitesh Narayan Lal <nitesh@xxxxxxxxxx>
> >
> [...]
> >> +}
> >> +
> >> +/**
> >> + * __page_reporting_enqueue - tracks the freed page in the respective zone's
> >> + * bitmap and enqueues a new page reporting job to the workqueue if possible.
> >> + */
> >> +void __page_reporting_enqueue(struct page *page)
> >> +{
> >> +       struct page_reporting_config *phconf;
> >> +       struct zone *zone;
> >> +
> >> +       rcu_read_lock();
> >> +       /*
> >> +        * We should not process this page if either page reporting is not
> >> +        * yet completely enabled or it has been disabled by the backend.
> >> +        */
> >> +       phconf = rcu_dereference(page_reporting_conf);
> >> +       if (!phconf)
> >> +               return;
> >> +
> >> +       zone = page_zone(page);
> >> +       bitmap_set_bit(page, zone);
> >> +
> >> +       /*
> >> +        * We should not enqueue a job if a previously enqueued reporting work
> >> +        * is in progress or we don't have enough free pages in the zone.
> >> +        */
> >> +       if (atomic_read(&zone->free_pages) >= phconf->max_pages &&
> >> +           !atomic_cmpxchg(&phconf->refcnt, 0, 1))
> > This doesn't make any sense to me. Why are you only incrementing the
> > refcount if it is zero? Combining this with the assignment above, this
> > isn't really a refcnt. It is just an oversized bitflag.
>
>
> The intent for having an extra variable was to ensure that at a time only one
> reporting job is enqueued. I do agree that for that purpose I really don't need
> a reference counter and I should have used something like bool
> 'page_hinting_active'. But with bool, I think there could be a possible chance
> of race. Maybe I should rename this variable and keep it as atomic.
> Any thoughts?

You could just use a bitflag to achieve what you are doing here. That
is the primary use case for many of the test_and_set_bit type
operations. However one issue with doing it as a bitflag is that you
have no way of telling that you took care of all requesters. That is
where having an actual reference count comes in handy as you know
exactly how many zones are requesting to be reported on.

> >
> > Also I am pretty sure this results in the opportunity to miss pages
> > because there is nothing to prevent you from possibly missing a ton of
> > pages you could hint on if a large number of pages are pushed out all
> > at once and then the system goes idle in terms of memory allocation
> > and freeing.
>
>
> I was looking at how you are enqueuing/processing reporting jobs for each zone.
> I am wondering if I should also consider something on similar lines as having
> that I might be able to address the concern which you have raised above. But it
> would also mean that I have to add an additional flag in the zone_flags. :)

You could do it either in the zone or outside the zone as yet another
bitmap. I decided to put the flags inside the zone because there was a
number of free bits there and it should be faster since we were
already using the zone structure.