Hello Raghav, Thanks for your reply. > But I have my own doubts on swapping which I would like to get > cleared. I am unable to get the reason why at all a shared page > gets transferred to the disk as long as it is in use. > > Why won't the following steps work: > (i) In case the page is shared and the 1st time try_to_swap_out() > is called : the page is transferred to swap cache and > __free_page() is called the page count is not zero. Then do not > transfer the page to disk. > (ii) When the last process that shared the page calls > try_to_swap_out: the pagecount hits 0. It would only drop to 1, since the swap cache also has a reference on that page(?) > Then transfer the page to > disk. > This way for shared pages only one disk transer(which is > expensive) gets done for shared pages. I was wondering about that as well. However, I could not find the problem in 2.4.2 (see below my commented try_to_swap_out() code from 2.4.2 and 2.2.18 respectively - I marked my comments with ### M. Maletinsky). See especially the comments at the end of each of the code excerpts. with best regards Martin Maletinsky --------------------------------- 2.2.18: static int try_to_swap_out(struct task_struct * tsk, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, int gfp_mask) { pte_t pte; unsigned long entry; unsigned long page; struct page * page_map; pte = *page_table; if (!pte_present(pte)) return 0; page = pte_page(pte); if (MAP_NR(page) >= max_mapnr) return 0; page_map = mem_map + MAP_NR(page); if (pte_young(pte)) { /* * Transfer the "accessed" bit from the page * tables to the global page map. */ set_pte(page_table, pte_mkold(pte)); flush_tlb_page(vma, address); set_bit(PG_referenced, &page_map->flags); return 0; } if (PageReserved(page_map) || PageLocked(page_map) || ((gfp_mask & __GFP_DMA) && !PageDMA(page_map))) return 0; /* * Is the page already in the swap cache? If so, then * we can just drop our reference to it without doing * any IO - it's already up-to-date on disk. ### M. Maletinsky: Why is that? the page may have become dirty by the process from which it is currently being unmapped (i.e. tsk). In this case the in-memory image differs from the on disk image, while the page descriptor does not have its PG_dirty bit set. Moreover *pte (which had it's dirty bit being set by the MMU, when the process did write into the page) is discarded by the subsequent lines of code - with the result, that the information that the page was written to is lost. * * Return 0, as we didn't actually free any real * memory, and we should just continue our scan. */ if (PageSwapCache(page_map)) { entry = page_map->offset; swap_duplicate(entry); set_pte(page_table, __pte(entry)); drop_pte: vma->vm_mm->rss--; flush_tlb_page(vma, address); ### M. Maletinsky: This is the latest point, where I would expect the page to become dirty (actually in the 2.4.2 code the page is made dirty at more or less this point - see below). __free_page(page_map); return 0; } /* * Is it a clean page? Then it must be recoverable * by just paging it in again, and we can just drop * it.. * * However, this won't actually free any real * memory, as the page will just be in the page cache * somewhere, and as such we should just continue * our scan. * * Basically, this just makes it possible for us to do * some real work in the future in "shrink_mmap()". */ if (!pte_dirty(pte)) { flush_cache_page(vma, address); pte_clear(page_table); goto drop_pte; } /* * Don't go down into the swap-out stuff if * we cannot do I/O! Avoid recursing on FS * locks etc. */ if (!(gfp_mask & __GFP_IO)) return 0; /* * Ok, it's really dirty. That means that * we should either create a new swap cache * entry for it, or we should write it back * to its own backing store. * * Note that in neither case do we actually * know that we make a page available, but * as we potentially sleep we can no longer * continue scanning, so we migth as well * assume we free'd something. * * NOTE NOTE NOTE! This should just set a * dirty bit in page_map, and just drop the * pte. All the hard work would be done by * shrink_mmap(). * * That would get rid of a lot of problems. */ flush_cache_page(vma, address); if (vma->vm_ops && vma->vm_ops->swapout) { pid_t pid = tsk->pid; pte_clear(page_table); flush_tlb_page(vma, address); vma->vm_mm->rss--; if (vma->vm_ops->swapout(vma, page_map)) kill_proc(pid, SIGBUS, 1); __free_page(page_map); return 1; } /* * This is a dirty, swappable page. First of all, * get a suitable swap entry for it, and make sure * we have the swap cache set up to associate the * page with that swap entry. */ entry = get_swap_page(); if (!entry) return 0; /* No swap space left */ vma->vm_mm->rss--; tsk->nswap++; set_pte(page_table, __pte(entry)); flush_tlb_page(vma, address); swap_duplicate(entry); /* One for the process, one for the swap cache */ add_to_swap_cache(page_map, entry); /* We checked we were unlocked way up above, and we have been careful not to stall until here */ set_bit(PG_locked, &page_map->flags); ### M. Maletinsky: This I think is the point your (Raghav) mentioned in your mail. Why do you write a page to disk, which potentially may still have a reference from another process? Wouldn't it make more sense to write it to disk only once, when the last reference is dropped? /* OK, do a physical asynchronous write to swap. */ rw_swap_page(WRITE, entry, (char *) page, 0); __free_page(page_map); return 1; } --------------------------------- 2.4.2 static void try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page) { pte_t pte; swp_entry_t entry; /* Don't look at this pte if it's been accessed recently. */ if (ptep_test_and_clear_young(page_table)) { page->age += PAGE_AGE_ADV; if (page->age > PAGE_AGE_MAX) page->age = PAGE_AGE_MAX; return; } if (TryLockPage(page)) return; /* From this point on, the odds are that we're going to * nuke this pte, so read and clear the pte. This hook * is needed on CPUs which update the accessed and dirty * bits in hardware. */ pte = ptep_get_and_clear(page_table); flush_tlb_page(vma, address); /* * Is the page already in the swap cache? If so, then * we can just drop our reference to it without doing * any IO - it's already up-to-date on disk. */ if (PageSwapCache(page)) { entry.val = page->index; ### M.Maletinsky: This seems to fix the problem I mentioned above. However, does that mean, the code in 2.2.18 did not work correctly? if (pte_dirty(pte)) set_page_dirty(page); set_swap_pte: swap_duplicate(entry); set_pte(page_table, swp_entry_to_pte(entry)); drop_pte: mm->rss--; if (!page->age) deactivate_page(page); UnlockPage(page); page_cache_release(page); return; } /* * Is it a clean page? Then it must be recoverable * by just paging it in again, and we can just drop * it.. * * However, this won't actually free any real * memory, as the page will just be in the page cache * somewhere, and as such we should just continue * our scan. * * Basically, this just makes it possible for us to do * some real work in the future in "refill_inactive()". */ flush_cache_page(vma, address); if (!pte_dirty(pte)) goto drop_pte; /* * Ok, it's really dirty. That means that * we should either create a new swap cache * entry for it, or we should write it back * to its own backing store. */ if (page->mapping) { set_page_dirty(page); goto drop_pte; } /* * This is a dirty, swappable page. First of all, * get a suitable swap entry for it, and make sure * we have the swap cache set up to associate the * page with that swap entry. */ entry = get_swap_page(); if (!entry.val) goto out_unlock_restore; /* No swap space left */ /* Add it to the swap cache and mark it dirty */ add_to_swap_cache(page, entry); set_page_dirty(page); goto set_swap_pte; ### M. Maletinsky: >From what I can see, the page is *NOT* written to disk, in contradiction to what you (Raghav) write in your mail. out_unlock_restore: set_pte(page_table, pte); UnlockPage(page); return; } -- Supercomputing System AG email: maletinsky@scs.ch Martin Maletinsky phone: +41 (0)1 445 16 05 Technoparkstrasse 1 fax: +41 (0)1 445 16 10 CH-8005 Zurich -- Kernelnewbies: Help each other learn about the Linux kernel. Archive: http://mail.nl.linux.org/kernelnewbies/ FAQ: http://kernelnewbies.org/faq/