On Thu 19-02-15 08:55:23, Dave Chinner wrote: > On Wed, Feb 18, 2015 at 11:40:09AM +0100, Jan Kara wrote: > > On Tue 17-02-15 08:37:45, Matthew Wilcox wrote: > > > On Tue, Feb 17, 2015 at 09:52:00AM +0100, Jan Kara wrote: > > > > > > This got added to fix a problem that Dave Chinner pointed out. We need > > > > > > the allocated extent to either be zeroed (as ext2 does), or marked as > > > > > > unwritten (ext4, XFS) so that a racing read/page fault doesn't return > > > > > > uninitialized data. If it's marked as unwritten, we need to convert it > > > > > > to a written extent after we've initialised the contents. We use the > > > > > > b_end_io() callback to do this, and it's called from the DAX code, not in > > > > > > softirq context. > > > > > OK, I see. But I didn't find where ->b_end_io gets called from dax code > > > > > (specifically I don't see it anywhere in dax_do_IO() or dax_io()). Can you > > > > > point me please? > > > > > > For faults, we call it in dax_insert_mapping(), the very last thing > > > before returning in the fault path. The normal I/O path gets to use > > > the dio_iodone_t for the same purpose. > > I see. I didn't think of races with reads (hum, I actually wonder whether > > we don't have this data exposure problem for ext4 for mmapped write into > > a hole vs direct read as well). So I guess we do need those unwritten > > extent dances after all (or we would need to have a page covering hole when > > writing to it via mmap but I guess unwritten extent dances are somewhat > > more standard). > > Right, that was the reason for doing it that way - it leveraged all > the existing methods we have for avoiding data exposure races in > XFS. but it's also not just for races - it's for ensuring that if we > crash between the allocation and the write to the persistent store > we don't expose the underlying contents when the system next comes > up. Well, ext3/4 handles the crash situation differently - we make sure we flush data to allocated blocks before committing a transaction that allocates them. That works perfectly for crashes but doesn't avoid the race with DIO. > > > > > Also abusing b_end_io of a phony buffer for that looks ugly to me (we are > > > > > trying to get away from passing phony bh around and this would entangle us > > > > > even more into that mess). Normally I would think that end_io() callback > > > > > passed into dax_do_io() should perform necessary conversions and for > > > > > dax_fault() we could do necessary conversions inside foofs_page_mkwrite()... > > > > > > Dave sees to be the one trying the hardest to get rid of the phony BHs > > > ... and it was his idea to (ab)use b_end_io for this. The problem with > > > doing the conversion in ext4_page_mkwrite() is that we don't know at > > > that point whether the BH is unwritten or not. > > I see, thanks for explanation. So it would be enough to pass a bit of > > information (unwritten / written) to a caller of do_dax_fault() and that > > can then call conversion function. IMO doing that (either in a return value > > or in a separate argument of do_dax_fault()) would be cleaner and more > > readable than playing with b_private and b_end_io... Thoughts? > > I'm happy for a better mechanism to be thought up. The current one > was expedient, but not pretty. The need for the end_io function was > because unwritten conversion needed to happen after the page was > zeroed. If we change dax_fault() to also take a "end_io" function > callback (already takes a get_blocks callback), then we can avoid > the need to use the phony bh for this purpose. i.e filesystems that > allocate unwritten extents can pass a completion function Yeah, that's probably even better. > And speaking of phony BHs, the pnfs block layout changes introduce > an struct iomap and a "map_blocks" method to the export_ops in > exportfs.h. This is the model what we should be using instead of > phony BHs for block mapping/allocation operations... Yup, that'd be nice. Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html