On Tue, Apr 22, 2014 at 8:08 PM, Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > At first I thought you were right, and set_page_dirty_lock() needed; > but now I think not. We only(?) need set_page_dirty_lock() when > there's a danger that the struct address_space itself might be > evicted beneath us. > > But here (even without relying on Al's delayed_fput) the fput() > is done by remove_vma() called _after_ the tlb_finish_mmu(). > So I see no danger of struct address_space being freed: Yeah, that's what the comments say. I'm a bit worried about "page->mapping" races, though. So I don't think the address_space gets freed, but I worry about truncate NULL'ing out page->mapping under us. There, the page being either mapped in a page table (with the page table lock held), or holding the page lock should protect us against concurrent truncate doing that. So I'm still a bit worried. > page->mapping might be truncated to NULL at any moment without page > lock and without mapping->tree_lock (and without page table lock > helping to serialize page_mapped against unmap_mapping_range); but > __set_page_dirty_nobuffers() (admittedly not the only set_page_dirty) > is careful about that, and it's just not the problem Dave is seeing. Yeah, I guess you're right. And no, page->mapping becoming NULL doesn't actually explain Dave's issue anyway. > However... Virginia and I get rather giddy when it comes to the > clear_page_dirty_for_io() page_mkclean() dance: we trust you and > Peter and Jan on that, and page lock there has some part to play. > > My suspicion, not backed up at all, is that by leaving the > set_page_dirty() until after the page has been unmapped (and page > table lock dropped), that dance has been disturbed in such a way > that ext4 can be tricked into freeing page buffers before it will > need them again for a final dirty writeout. So I don't think it's new, but I agree that it opens a *much* wider window where the dirty bit is not visible in either the page tables or (yet) in the page dirty bit. The same does happen in try_to_unmap_one() and in clear_page_dirty_for_io(), but in both of those cases we hold the page lock for the entire duration of the sequence, so this whole "page is not visibly dirty and page lock is not held" is new. And yes, I could imagine that that makes some code go "Ahh, we don't need buffers for that page any more then". In short, I think your suspicion is on the right track. I will have to think about this. But I'm starting to consider this whole thing to be a 3.16 issue by now. It wasn't as simple as it looked, and while our old location of set_page_dirty() is clearly wrong, and DaveH even got a test-case for it (which I initially doubted would even be possible), I still seriously doubt that anybody sane who cares about data consistency will do concurrent unmaps (or MADV_DONTNEED) while another writer is actively using that mapping. Pretty much by definition, if you care about data consistency, you'd never do insane things like that. You'd make damn sure that every writer is all done before you unmap the area they are writing to. So this is a "oops, we're clearly doing something wrong by marking the page dirty too early early, but anybody who really hits it has it coming to them" kind of situation. We want to fix it, but it doesn't seem to be a high priority. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html