On Mon, 14 May 2012, Andrew Morton wrote: > On Sat, 12 May 2012 04:59:56 -0700 (PDT) > Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > We'd like to continue to support GMA500, so now add a new > > shmem_should_replace_page() check on the zone when about to move > > a page from swapcache to filecache (in swapin and swapoff cases), > > with shmem_replace_page() to allocate and substitute a suitable page > > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32). > > ... > > + gfp = mapping_gfp_mask(mapping); > > + if (shmem_should_replace_page(*pagep, gfp)) { > > + mutex_unlock(&shmem_swaplist_mutex); > > + error = shmem_replace_page(pagep, gfp, info, index); > > + mutex_lock(&shmem_swaplist_mutex); > > + /* > > + * We needed to drop mutex to make that restrictive page > > + * allocation; but the inode might already be freed by now, > > + * and we cannot refer to inode or mapping or info to check. > > + * However, we do hold page lock on the PageSwapCache page, > > + * so can check if that still has our reference remaining. > > + */ > > + if (!page_swapcount(*pagep)) > > + error = -ENOENT; > > This has my head spinning a bit. What is "our reference"? I'd expect > that to mean a temporary reference which was taken by this thread of > control. (I'm sure you'll prefer a reworking of that comment in an incremental fixes patch, but let me try to explain better here too.) No, I didn't mean a temporary reference taken by this (swapoff) thread, but the reference (swap entry) which has just been located in the inode's radix_tree, just before this hunk: which would be tracked by page_swapcount 1 (there's also a page swapcache bit in the swap_map along with the count, corresponding to the reference from the swapcache page itself, but that's not included in page_swapcount). > But such a thing has no relevance when trying to determine > the state of the page and/or data structures which refer to it. I don't understand you there, but maybe it won't matter. > > Also, what are we trying to determine here with this test? Whether the > page was removed from swapcache under our feet? Presumably not, as it > is locked. > > So perhaps you could spell out in more detail what we're trying to do > here, and what contributes to page_swapcount() here? The danger here is that the inode we're dealing with has gone through shmem_evict_inode() while we dropped shmem_swaplist_mutex: inode was certainly in use before, and shmem_swaplist_mutex (together with inode being on shmem_swaplist) holds it up from being evicted and freed; but once we drop the mutex, it could go away at any moment. We cannot determine that by looking at struct inode or struct address_space or struct shmem_inode_info, they're all part of what would be freed; but we cannot proceed to shmem_add_to_page_cache() once they're freed. How to tell whether it's been freed? Once upon a time I "solved" it with igrab() and iput(), but Konstantin demonstrated how that gives no safety against unmounting, and I remain reluctant to go back to relying upon filesystem semantics to solve this. It occurred to me that the inode cannot be freed until that radix_tree entry has been removed (by shmem_evict_inode's shmem_truncate_range), and the act or removing that entry (free_swap_and_cache) brings page_swapcount down from 1 to 0. You're thinking that the page cannot be removed from swapcache while we hold page lock: correct, but... free_swap_and_cache() only does a trylock_page(), and happily leaves the swapcache page to be garbage collected later if it cannot get the page lock. (And I certainly would not want to change it to wait for page lock.) So, the inode can get evicted while the page is still in swapcache: the page lock gives no protection against that, until the page itself gets into the radix_tree. I doubt that writing this essay into a comment there will be the right thing to do (and I may still be losing you); but I shall try to rewrite it, and if there's one missing fact that needs highlighting, it probably is that last, that free_swap_and_cache() only does a trylock, so our page lock does not protect the inode from eviction. (At this moment, I can't think what is the relevance of my comment "we do hold page lock on the PageSwapCache page": in other contexts it would be important, but here in swapoff we know that that swap cannot get reused, or not before we're done.) > > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct > > struct list_head *this, *next; > > struct shmem_inode_info *info; > > int found = 0; > > - int error; > > + int error = 0; > > + > > + /* > > + * There's a faint possibility that swap page was replaced before > > + * caller locked it: it will come back later with the right page. > > So a caller locked the page then failed to check that it's still the > right sort of page? Shouldn't the caller locally clean up its own mess > rather than requiring a callee to know about the caller's intricate > shortcomings? The caller being try_to_unuse(). You're certainly not the first to argue that way. Perhaps I'm a bit perverse, in letting code which works even in the surprising cases, remain as it is without weeding out those surprising cases. And on this occasion didn't want to add an additional dependence on a slight subtle change in mm/swapfile.c functionality. Hmm, yes, I do still prefer to have the check here in shmem.c: particularly since it is this "shmem_replace_page" business which is increasing the likelihood of such a race, and making further demands on it (if we're going to make the copied page PageSwapCache, then we need to be sure that the page it's replacing was PageSwapCache - though that's something I need to think through again in the light of the race which I thought of in responding to Cong). > > + newpage = shmem_alloc_page(gfp, info, index); > > + if (!newpage) > > + return -ENOMEM; > > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp)); > > + > > + *pagep = newpage; > > + page_cache_get(newpage); > > + copy_highpage(newpage, oldpage); > > copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()? Ooh, I'm pretty sure you're right that we do need flush_dcache_page() there: good catch, thank you. We can't use copy_user_highpage() because in general we don't know any address and vma; but should be following the shmem_getpage_gfp() pattern of clear_highpage+flush_dcache_page+SetUptodate. > > shmem_replace_page() is a fairly generic and unexceptional sounding > thing. Methinks shmem_substitute_page() would be a better name. Okay, shmem_replace_page() seemed appropriate to me (especially thinking of it as "re-place"), but I don't mind changing to shmem_substitute_page(). The flush_dcache_page() addition is important, but until people are using GMA500 on ARM or something (I doubt that combination) with more than 4GB, this code is not coming into play - so I'm not breaking anyone's system if it sneaks into linux-next before I fix that. The main thing I need to think through quietly is the slippery swap race: I'll send you an incremental patch to fix all these up once I'm satisfied on that. Thanks, Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html