Re: Question on swapping

Martin Maletinsky <maletinsky@scs.ch> · Fri, 06 Dec 2002 15:34:42 +0100

Hello Raghav,

Thanks for your reply.

> But I have my own doubts on swapping which I would like to get
> cleared. I am unable to get the reason why at all a shared page
> gets transferred to the disk as long as it is in use.
>
> Why won't the following steps work:
> (i) In case the page is shared and the 1st time try_to_swap_out()
> is called : the page is transferred to swap cache and
> __free_page() is called the page count is not zero. Then do not
> transfer the page to disk.
> (ii) When the last process that shared the page calls
> try_to_swap_out: the pagecount hits 0.

It would only drop to 1, since the swap cache also has a reference on that page(?)

> Then transfer the page to
> disk.
> This way for shared pages only one disk transer(which is
> expensive) gets done for shared pages.

I was wondering about that as well. However, I could not find the problem in 2.4.2 (see below my commented try_to_swap_out() code from 2.4.2 and 2.2.18 respectively - I
marked my comments with ### M. Maletinsky).
See especially the comments at the end of each of the code excerpts.

with best regards
Martin Maletinsky

---------------------------------
2.2.18:

 static int try_to_swap_out(struct task_struct * tsk, struct vm_area_struct* vma,
         unsigned long address, pte_t * page_table, int gfp_mask)
 {
         pte_t pte;
         unsigned long entry;
         unsigned long page;
         struct page * page_map;

         pte = *page_table;
         if (!pte_present(pte))
                 return 0;
         page = pte_page(pte);
         if (MAP_NR(page) >= max_mapnr)
                 return 0;
         page_map = mem_map + MAP_NR(page);

         if (pte_young(pte)) {
                 /*
                  * Transfer the "accessed" bit from the page
                  * tables to the global page map.
                  */
                 set_pte(page_table, pte_mkold(pte));
                 flush_tlb_page(vma, address);
                 set_bit(PG_referenced, &page_map->flags);
                 return 0;
         }

         if (PageReserved(page_map)
             || PageLocked(page_map)
             || ((gfp_mask & __GFP_DMA) && !PageDMA(page_map)))
                 return 0;

         /*
          * Is the page already in the swap cache? If so, then
          * we can just drop our reference to it without doing
          * any IO - it's already up-to-date on disk.

### M. Maletinsky:
Why is that? the page may have become dirty by the process from which it is currently being unmapped (i.e. tsk). In this case the in-memory image differs from the on disk
image, while the
page descriptor does not have its PG_dirty bit set. Moreover *pte (which had it's dirty bit being set by the MMU, when the process did write into the page) is discarded by
the subsequent lines of code - with the result, that the information that the page was written to is lost.

          *
          * Return 0, as we didn't actually free any real
          * memory, and we should just continue our scan.
          */
         if (PageSwapCache(page_map)) {
                entry = page_map->offset;
                 swap_duplicate(entry);
                 set_pte(page_table, __pte(entry));
 drop_pte:
                 vma->vm_mm->rss--;
                 flush_tlb_page(vma, address);

### M. Maletinsky:
This is the latest point, where I would expect the page to become dirty (actually in the 2.4.2 code the page is made dirty at more or less this point - see below).

                 __free_page(page_map);
                return 0;
        }

        /*
         * Is it a clean page? Then it must be recoverable
         * by just paging it in again, and we can just drop
         * it..
         *
         * However, this won't actually free any real
         * memory, as the page will just be in the page cache
         * somewhere, and as such we should just continue
         * our scan.
         *
         * Basically, this just makes it possible for us to do
         * some real work in the future in "shrink_mmap()".
         */
        if (!pte_dirty(pte)) {
                flush_cache_page(vma, address);
                pte_clear(page_table);
                goto drop_pte;
        }

        /*
         * Don't go down into the swap-out stuff if
         * we cannot do I/O! Avoid recursing on FS
         * locks etc.
         */
        if (!(gfp_mask & __GFP_IO))
                return 0;

        /*
         * Ok, it's really dirty. That means that
         * we should either create a new swap cache
         * entry for it, or we should write it back
         * to its own backing store.
         *
         * Note that in neither case do we actually
         * know that we make a page available, but
         * as we potentially sleep we can no longer
         * continue scanning, so we migth as well
         * assume we free'd something.
         *
         * NOTE NOTE NOTE! This should just set a
         * dirty bit in page_map, and just drop the
         * pte. All the hard work would be done by
         * shrink_mmap().
         *
         * That would get rid of a lot of problems.
         */
        flush_cache_page(vma, address);
        if (vma->vm_ops && vma->vm_ops->swapout) {
                pid_t pid = tsk->pid;
                pte_clear(page_table);
                flush_tlb_page(vma, address);
                vma->vm_mm->rss--;

                if (vma->vm_ops->swapout(vma, page_map))
                        kill_proc(pid, SIGBUS, 1);
                __free_page(page_map);
                return 1;
        }

        /*
         * This is a dirty, swappable page.  First of all,
         * get a suitable swap entry for it, and make sure
         * we have the swap cache set up to associate the
         * page with that swap entry.
         */
        entry = get_swap_page();
        if (!entry)
                return 0; /* No swap space left */

        vma->vm_mm->rss--;
        tsk->nswap++;
        set_pte(page_table, __pte(entry));
        flush_tlb_page(vma, address);
        swap_duplicate(entry);  /* One for the process, one for the swap cache */
        add_to_swap_cache(page_map, entry);
        /* We checked we were unlocked way up above, and we
           have been careful not to stall until here */
        set_bit(PG_locked, &page_map->flags);

### M. Maletinsky:
This I think is the point your (Raghav) mentioned  in your mail. Why do you write a page to disk, which potentially may still have a reference from another process?
Wouldn't it make more sense to write it to disk only once, when the last reference is dropped?

        /* OK, do a physical asynchronous write to swap.  */
        rw_swap_page(WRITE, entry, (char *) page, 0);

        __free_page(page_map);
        return 1;
}

---------------------------------
2.4.2

static void try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page)
{
        pte_t pte;
        swp_entry_t entry;

        /* Don't look at this pte if it's been accessed recently. */
        if (ptep_test_and_clear_young(page_table)) {
                page->age += PAGE_AGE_ADV;
                if (page->age > PAGE_AGE_MAX)
                        page->age = PAGE_AGE_MAX;
                return;
        }

        if (TryLockPage(page))
                return;

        /* From this point on, the odds are that we're going to
         * nuke this pte, so read and clear the pte.  This hook
         * is needed on CPUs which update the accessed and dirty
         * bits in hardware.
         */
        pte = ptep_get_and_clear(page_table);
        flush_tlb_page(vma, address);

        /*
         * Is the page already in the swap cache? If so, then
         * we can just drop our reference to it without doing
         * any IO - it's already up-to-date on disk.
         */
        if (PageSwapCache(page)) {
                entry.val = page->index;

### M.Maletinsky:
This seems to fix the problem I mentioned above. However, does that mean, the code in 2.2.18 did not work correctly?

                if (pte_dirty(pte))
                        set_page_dirty(page);
set_swap_pte:
                swap_duplicate(entry);
                set_pte(page_table, swp_entry_to_pte(entry));
drop_pte:
                mm->rss--;
                if (!page->age)
                        deactivate_page(page);
                 UnlockPage(page);
                 page_cache_release(page);
                return;
        }

        /*
         * Is it a clean page? Then it must be recoverable
         * by just paging it in again, and we can just drop
         * it..
         *
         * However, this won't actually free any real
         * memory, as the page will just be in the page cache
         * somewhere, and as such we should just continue
         * our scan.
         *
         * Basically, this just makes it possible for us to do
         * some real work in the future in "refill_inactive()".
         */
        flush_cache_page(vma, address);
        if (!pte_dirty(pte))
                goto drop_pte;

        /*
         * Ok, it's really dirty. That means that
         * we should either create a new swap cache
         * entry for it, or we should write it back
         * to its own backing store.
         */
        if (page->mapping) {
                 set_page_dirty(page);
                goto drop_pte;
         }

        /*
         * This is a dirty, swappable page.  First of all,
         * get a suitable swap entry for it, and make sure
         * we have the swap cache set up to associate the
         * page with that swap entry.
         */
        entry = get_swap_page();
        if (!entry.val)
                goto out_unlock_restore; /* No swap space left */

        /* Add it to the swap cache and mark it dirty */
        add_to_swap_cache(page, entry);
        set_page_dirty(page);
        goto set_swap_pte;

### M. Maletinsky:
>From what I can see, the page is *NOT* written to disk, in contradiction to what you (Raghav) write in your mail.

out_unlock_restore:
       set_pte(page_table, pte);
        UnlockPage(page);
        return;
}

--
Supercomputing System AG          email: maletinsky@scs.ch
Martin Maletinsky                 phone: +41 (0)1 445 16 05
Technoparkstrasse 1               fax:   +41 (0)1 445 16 10
CH-8005 Zurich

--
Kernelnewbies: Help each other learn about the Linux kernel.
Archive:       http://mail.nl.linux.org/kernelnewbies/
FAQ:           http://kernelnewbies.org/faq/