[linux-pm] Re: [RFC][PATCH] swsusp: support creating bigger images

rjw at sisk.pl (Rafael J. Wysocki) · Tue Apr 25 08:39:21 2006

Hi,

First, thanks for reviewing.

On Tuesday 25 April 2006 12:28, Nick Piggin wrote:
> pageRafael J. Wysocki wrote:
> 
> > Index: linux-2.6.17-rc1-mm3/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.17-rc1-mm3.orig/mm/rmap.c	2006-04-22 10:34:33.000000000 +0200
> > +++ linux-2.6.17-rc1-mm3/mm/rmap.c	2006-04-23 00:41:19.000000000 +0200
> > @@ -851,3 +851,22 @@ int try_to_unmap(struct page *page, int 
> >  	return ret;
> >  }
> >  
> > +#ifdef CONFIG_SOFTWARE_SUSPEND
> > +int page_mapped_by_current(struct page *page)
> > +{
> > +	struct vm_area_struct *vma;
> > +	int ret = 0;
> > +
> > +	spin_lock(&current->mm->page_table_lock);
> > +
> > +	for (vma = current->mm->mmap; vma; vma = vma->vm_next)
> > +		if (page_address_in_vma(page, vma) != -EFAULT) {
> > +			ret = 1;
> > +			break;
> > +		}
> > +
> > +	spin_unlock(&current->mm->page_table_lock);
> > +
> > +	return ret;
> > +}
> > +#endif /* CONFIG_SOFTWARE_SUSPEND */
> 
> Why not just pass in the task struct? It might make the interface more
> useful elsewhere.

Good idea, I'll do that.

> > Index: linux-2.6.17-rc1-mm3/include/linux/rmap.h
> > ===================================================================
> > --- linux-2.6.17-rc1-mm3.orig/include/linux/rmap.h	2006-04-22 10:34:33.000000000 +0200
> > +++ linux-2.6.17-rc1-mm3/include/linux/rmap.h	2006-04-22 10:34:45.000000000 +0200
> > @@ -104,6 +104,12 @@ pte_t *page_check_address(struct page *,
> >   */
> >  unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
> >  
> > +#ifdef CONFIG_SOFTWARE_SUSPEND
> > +int page_mapped_by_current(struct page *);
> > +#else
> > +static inline int page_mapped_by_current(struct page *page) { return 0; }
> > +#endif /* CONFIG_SOFTWARE_SUSPEND */
> 
> This just wrong. What if some random unrelated code calls
> page_mapped_by_current? You should only fold inlines if their result
> folds accordingly.
> 
> For now just leave it out completely if !CONFIG_SOFTWARE_SUSPEND.

OK

>   > -unsigned int count_data_pages(void)
> > +/**
> > + *	need_to_copy - determine if a page needs to be copied before saving.
> > + *	Returns false if the page can be saved without copying.
> > + */
> > +
> > +static int need_to_copy(struct page *page)
> > +{
> > +	if (!PageLRU(page) || PageCompound(page))
> > +		return 1;
> > +	if (page_mapped(page))
> > +		return page_mapped_by_current(page);
> > +
> > +	return 1;
> > +}
> 
> I'd much rather VM internal type stuff get moved *out* of kernel/power :(

Well, I kind of agree, but I don't know where to place it under mm/.

> It needs more comments too. Also, how important is it for the page to be
> off the LRU?

Hm, I'm not sure if that's what you're asking about, but the pages off the LRU
are handled in a usual way, ie. copied when snapshotting the system.  The
pages _on_ the LRU may be included in the snapshot image without
copying, but I require them additionally to be (a) mapped by someone and
(b) not mapped by the current task.

> What if kswapd has taken it off the LRU temporarily (or is it 
> guanteed not to at this point)? What if it is in an lru_add pagevec?

At this point kswapd is frozen as well as everything else except for the
current task (ie. us).  IOW no one is expected to process the LRU lists.

> > +
> > +unsigned int count_data_pages(unsigned int *total)
> >  {
> >  	struct zone *zone;
> >  	unsigned long zone_pfn;
> > -	unsigned int n = 0;
> > +	unsigned int n, m;
> >  
> > +	n = m = 0;
> >  	for_each_zone (zone) {
> >  		if (is_highmem(zone))
> >  			continue;
> >  		mark_free_pages(zone);
> > -		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn)
> > -			n += saveable(zone, &zone_pfn);
> > +		for (zone_pfn = 0; zone_pfn < zone->spanned_pages; ++zone_pfn) {
> > +			unsigned long pfn = zone_pfn + zone->zone_start_pfn;
> > +
> > +			if (saveable(pfn)) {
> > +				n++;
> > +				if (!need_to_copy(pfn_to_page(pfn)))
> > +					m++;
> > +			}
> > +		}
> >  	}
> > -	return n;
> > +	if (total)
> > +		*total = n;
> > +
> > +	return n - m;
> >  }
> 
> Maybe we could add some commenting to these functions while changing them
> around.

OK, I will.

> > +static LIST_HEAD(active_pages);
> > +static LIST_HEAD(inactive_pages);
> > +
> > +/**
> > + *	protect_data_pages - move data pages that need to be protected from
> > + *	being reclaimed out of their respective LRU lists
> > + */
> > +
> > +static void protect_data_pages(struct pbe *pblist)
> > +{
> > +	struct pbe *p;
> > +
> > +	for_each_pbe (p, pblist)
> > +		if (p->address == p->orig_address) {
> > +			struct page *page = virt_to_page(p->address);
> > +			struct zone *zone = page_zone(page);
> > +
> > +			spin_lock(&zone->lru_lock);
> > +			if (PageActive(page)) {
> > +				del_page_from_active_list(zone, page);
> > +				list_add(&page->lru, &active_pages);
> > +			} else {
> > +				del_page_from_inactive_list(zone, page);
> > +				list_add(&page->lru, &inactive_pages);
> > +			}
> > +			spin_unlock(&zone->lru_lock);
> > +			ClearPageLRU(page);
> > +		}
> > +}
> 
> a) How do you know the page is on the LRU?

Because in copy_data_pages() p->address is only set to p->orig_address for
pages that are on the LRU.  No one can change the state of the pages in
between because everyone else is frozen and we have local IRQs disabled.

> b) Why are you clearing PageLRU outside the spinlock?

Well, good question. ;-)  Moreover it seems I don't need to acquire the
spinlock at all, because this is done on one CPU with IRQs disabled and the
other tasks frozen.

> c) This code isn't going outside mm/ even if it is correct.

OK, so could you please suggest me where to put it (or a corrected version)
under mm/?

> d) swsusp seems to be quite under documented as far as race conditions
>     vs the rest of the kernel go. In some places only a single CPU is
>     running with interrupts off, in other places all processes have
>     frozen, etc. etc. So it is hard to even know if synchronisation is
>     correct or not.
> 
>     Suggest more documentation goes into this.

OK

> > +
> > +/**
> > + *	restore_active_inactive_lists - if suspend fails, the pages protected
> > + *	with protect_data_pages() have to be moved back to their respective
> > + *	lists
> > + */
> > +
> > +static void restore_active_inactive_lists(void)
> 
> Ditto, mostly.
> 
> > +static int enough_free_mem(unsigned int nr_pages, unsigned int copy_pages)
> >  {
> >  	struct zone *zone;
> > -	unsigned int n = 0;
> > +	long n = 0;
> > +
> > +	pr_debug("swsusp: pages needed: %u + %lu + %lu, free: %u\n",
> > +		 copy_pages,
> > +		 (nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE,
> > +		 EXTRA_PAGES, nr_free_pages());
> >  
> >  	for_each_zone (zone)
> > -		if (!is_highmem(zone))
> > +		if (!is_highmem(zone) && populated_zone(zone)) {
> >  			n += zone->free_pages;
> > -	pr_debug("swsusp: available memory: %u pages\n", n);
> > -	return n > (nr_pages + PAGES_FOR_IO +
> > -		(nr_pages + PBES_PER_PAGE - 1) / PBES_PER_PAGE);
> > -}
> > -
> > -static int alloc_data_pages(struct pbe *pblist, gfp_t gfp_mask, int safe_needed)
> > -{
> > -	struct pbe *p;
> > +			n -= zone->lowmem_reserve[ZONE_NORMAL];
> 
> Comment for this? Eg. why are you taking the ZONE_NORMAL watermark
> in particular?

This is to sync enough_free_mem() with swsusp_shrink_memory() (in swsusp.c).
[We allocate with GFP_ATOMIC ie. from the normal zone and the lowmem reserves
in the lower zones are effectively unavailable to us because of the check in
zone_watermark_ok().]

Thanks a lot for the comments.  I'll do my best to follow your suggestions
in the next version of the patch.

Greetings,
Rafael