Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memory unnecessarily (rev. 2)

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Wed, 6 May 2009 21:52:53 +0800

On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@xxxxxxx>
> > > 
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames.  The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > > 
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > >  many pages as needed to get the right image size in one shot (the
> > >  excessive allocated pages are released afterwards).]
> > 
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [  579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [  583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
> 
> Unfortunately, I'm observing a regression and a huge one.
> 
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free.  To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
> 
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.

Yes there are some strange behaviors. I tried to populate the page
cache with 1/30 mapped file pages and others normal file pages, all
referenced once. I get this on "echo disk > /sys/power/state":

[  462.820098] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[  462.827161] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[  462.834249] PM: Basic memory bitmaps created
[  462.838631] PM: Syncing filesystems ... done.
[  463.167805] Freezing user space processes ... (elapsed 0.00 seconds) done.
[  463.175738] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[  463.183834] PM: Preallocating image memory ... done (allocated 383898 pages, 128000 image pages kept)
[  469.605741] PM: Allocated 1535592 kbytes in 6.41 seconds (239.56 MB/s)
[  469.612325]
[  469.768796] Restarting tasks ... done.
[  469.775044] PM: Basic memory bitmaps freed

Immediately after that, I copied a big sparse file into memory, and get this:

[  508.097913] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[  508.104799] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[  508.111702] PM: Basic memory bitmaps created
[  508.116073] PM: Syncing filesystems ... done.
[  509.208608] Freezing user space processes ... (elapsed 0.00 seconds) done.
[  509.216692] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[  509.224708] PM: Preallocating image memory ... done (allocated 383872 pages, 128000 image pages kept)
[  520.951882] PM: Allocated 1535488 kbytes in 11.71 seconds (131.12 MB/s)

It's much worse.

Your patches are really interesting exercises for the vmscan code ;-)

> > +       error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> > +       if (error)
> > +               goto err_out;
> > +
> > +       error = memory_bm_create(&copy_bm, GFP_IMAGE, PG_ANY);
> > +       if (error)
> > +               goto err_out;
> > 
> > memory_bm_create() is called a number of times, each time it will
> > call create_mem_extents()/memory_bm_free(). Can they be optimized to
> > be called only once?
> 
> Possibly, but not right now if you please?  This is just moving code BTW.

OK.

> 
> > A side note: there are somehow duplicated *_extent_*() logics in the
> > filesystems, is it possible that we abstract out some of the common code?
> 
> I think we can do it, but it really is low priority to me at the moment.

OK. Just was a wild thought.

> 
> > +       for_each_populated_zone(zone) {
> > +               size += snapshot_additional_pages(zone);
> > +               count += zone_page_state(zone, NR_FREE_PAGES);
> > +               if (!is_highmem(zone))
> > +                       count -= zone->lowmem_reserve[ZONE_NORMAL];
> > +       }
> > 
> > Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> > for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
> 
> Ah, this is a leftover and it should be changed or even dropped.  Can you
> please remind me how exactly lowmem_reserve[] is supposed to work?

totalreserve_pages could be better. When free memory drops below that
threshold(it actually works per zone), kswapd will wake up trying to
reclaim pages. If the total reclaimable+free pages are as low as
totalreserve_pages, that would drive kswapd mad - scanning the whole
zones, trying to squeeze the last pages out of them.  Sure kswapd will
stop somewhere, but the resulting scan:reclaim ratio would be pretty
high and therefore hurt performance.

So we shall stop preallocation when reclaimable pages go down to
something like (5*totalreserve_pages). The vmscan mad may come earlier
because of unbalanced distributions of reclaimable pages among the zones.

> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
> 
> Hmm.  I don't see immediately why would it.  Maybe the regression I'm seeing
> is related to that ...

OK. Anyway a preallocate based shrinking policy could be far from optimal. 
I'd suggest to switch to user space directed shrinking via fadvise(DONTNEED),
and leave the kernel one a fail safe path.  The user space tool could
gather page information from the filecache interface which I've been
maintaining out of tree, and to drop inactive/active pages from large
files first. That should be a better policy at least for rotational disks.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe kernel-testers" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html