On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote: > On Wed 13-01-16 00:30:19, Ross Zwisler wrote: > > > And secondly: You must write-protect all mappings of the flushed range so > > > that you get fault when the sector gets written-to again. We spoke about > > > this in the past already but somehow it got lost and I forgot about it as > > > well. You need something like rmap_walk_file()... > > > > The code that write protected mappings and then cleaned the radix tree entries > > did get written, and was part of v2: > > > > https://lkml.org/lkml/2015/11/13/759 > > > > I removed all the code that cleaned PTE entries and radix tree entries for v3. > > The reason behind this was that there was a race that I couldn't figure out > > how to solve between the cleaning of the PTEs and the cleaning of the radix > > tree entries. > > > > The race goes like this: > > > > Thread 1 (write) Thread 2 (fsync) > > ================ ================ > > wp_pfn_shared() > > pfn_mkwrite() > > dax_radix_entry() > > radix_tree_tag_set(DIRTY) > > dax_writeback_mapping_range() > > dax_writeback_one() > > radix_tag_clear(DIRTY) > > pgoff_mkclean() > > ... return up to wp_pfn_shared() > > wp_page_reuse() > > pte_mkdirty() > > > > After this sequence we end up with a dirty PTE that is writeable, but with a > > clean radix tree entry. This means that users can write to the page, but that > > a follow-up fsync or msync won't flush this dirty data to media. > > > > The overall issue is that in the write path that goes through wp_pfn_shared(), > > the DAX code has control over when the radix tree entry is dirtied but not > > when the PTE is made dirty and writeable. This happens up in wp_page_reuse(). > > This means that we can't easily add locking, etc. to protect ourselves. > > > > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really > > easy solutions presented themselves in the absence of a page lock. I do have > > one idea, but I think it's pretty invasive and will need to wait for another > > kernel cycle. > > > > The current code that leaves the radix tree entry will give us correct > > behavior - it'll just be less efficient because we will have an ever-growing > > dirty set to flush. > > Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree > tags but it does not (I should have known, I have written that functions > few years ago ;). Makes sense. Thanks for clarification. > > > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault); > > > > * dax_pfn_mkwrite - handle first write to DAX page > > > > * @vma: The virtual memory area where the fault occurred > > > > * @vmf: The description of the fault > > > > - * > > > > */ > > > > int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) > > > > { > > > > - struct super_block *sb = file_inode(vma->vm_file)->i_sb; > > > > + struct file *file = vma->vm_file; > > > > > > > > - sb_start_pagefault(sb); > > > > - file_update_time(vma->vm_file); > > > > - sb_end_pagefault(sb); > > > > + dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true); > > > > > > Why is NO_SECTOR argument correct here? > > > > Right - so NO_SECTOR means "I expect there to already be an entry in the radix > > tree - just make that entry dirty". This works because pfn_mkwrite() always > > follows a normal __dax_fault() or __dax_pmd_fault() call. These fault calls > > will insert the radix tree entry, regardless of whether the fault was for a > > read or a write. If the fault was for a write, the radix tree entry will also > > be made dirty. > > > > For reads the radix tree entry will be inserted but left clean. When the > > first write happens we will get a pfn_mkwrite() call, which will call > > dax_radix_entry() with the NO_SECTOR argument. This will look up the radix > > tree entry & set the dirty tag. > > So the explanation of this should be somewhere so that everyone knows that > we must have radix tree entries even for clean mapped blocks. Because upto > know that was not clear to me. Also __dax_pmd_fault() seems to insert > entries only for write fault so the assumption doesn't seem to hold there? Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for 4k pages. You are right about our handling of 2MiB pages - for a read followed by a write we will just call into the normal __dax_pmd_fault() code again, which will do the get_block() call and insert a dirty radix tree entry. Because we have to go all the way through the fault handler again at write time there isn't a benefit to inserting a clean radix tree entry on read, so we just skip it. > I'm somewhat uneasy that a bug in this logic can be hidden as a simple race > with hole punching. But I guess I can live with that. > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html