On Wed, Feb 18, 2015 at 11:40:09AM +0100, Jan Kara wrote: > On Tue 17-02-15 08:37:45, Matthew Wilcox wrote: > > On Tue, Feb 17, 2015 at 09:52:00AM +0100, Jan Kara wrote: > > > > > This got added to fix a problem that Dave Chinner pointed out. We need > > > > > the allocated extent to either be zeroed (as ext2 does), or marked as > > > > > unwritten (ext4, XFS) so that a racing read/page fault doesn't return > > > > > uninitialized data. If it's marked as unwritten, we need to convert it > > > > > to a written extent after we've initialised the contents. We use the > > > > > b_end_io() callback to do this, and it's called from the DAX code, not in > > > > > softirq context. > > > > OK, I see. But I didn't find where ->b_end_io gets called from dax code > > > > (specifically I don't see it anywhere in dax_do_IO() or dax_io()). Can you > > > > point me please? > > > > For faults, we call it in dax_insert_mapping(), the very last thing > > before returning in the fault path. The normal I/O path gets to use > > the dio_iodone_t for the same purpose. > I see. I didn't think of races with reads (hum, I actually wonder whether > we don't have this data exposure problem for ext4 for mmapped write into > a hole vs direct read as well). So I guess we do need those unwritten > extent dances after all (or we would need to have a page covering hole when > writing to it via mmap but I guess unwritten extent dances are somewhat > more standard). Right, that was the reason for doing it that way - it leveraged all the existing methods we have for avoiding data exposure races in XFS. but it's also not just for races - it's for ensuring that if we crash between the allocation and the write to the persistent store we don't expose the underlying contents when the system next comes up. These problems were found long ago on XFS and that's one of the reasons why all direct IO block allocation uses unwritten extents - until the data is on disk there is a window for stale data exposure and things like crashing systems running high throughput IO are very good at exposing such small windows of exposure. Hence I thought it best to have DAX close that hole for everyone. > So in that case you need ext4_get_block_write() in the above call to > do_dax_fault() (note that we still do need ext4 version of the fault > function like above due to lock ranking and retry on ENOSPC). And please > comment about the read races at that call site so that we have that > subtlety documented somewhere. Actually, I thought it was obvious that ;) > > > > > Also abusing b_end_io of a phony buffer for that looks ugly to me (we are > > > > trying to get away from passing phony bh around and this would entangle us > > > > even more into that mess). Normally I would think that end_io() callback > > > > passed into dax_do_io() should perform necessary conversions and for > > > > dax_fault() we could do necessary conversions inside foofs_page_mkwrite()... > > > > Dave sees to be the one trying the hardest to get rid of the phony BHs > > ... and it was his idea to (ab)use b_end_io for this. The problem with > > doing the conversion in ext4_page_mkwrite() is that we don't know at > > that point whether the BH is unwritten or not. > I see, thanks for explanation. So it would be enough to pass a bit of > information (unwritten / written) to a caller of do_dax_fault() and that > can then call conversion function. IMO doing that (either in a return value > or in a separate argument of do_dax_fault()) would be cleaner and more > readable than playing with b_private and b_end_io... Thoughts? I'm happy for a better mechanism to be thought up. The current one was expedient, but not pretty. The need for the end_io function was because unwritten conversion needed to happen after the page was zeroed. If we change dax_fault() to also take a "end_io" function callback (already takes a get_blocks callback), then we can avoid the need to use the phony bh for this purpose. i.e filesystems that allocate unwritten extents can pass a completion function I've got to update the XFS DAX patches I have for the next merge cycle, so I'll take another at it when I page that information back into my brain. And speaking of phony BHs, the pnfs block layout changes introduce an struct iomap and a "map_blocks" method to the export_ops in exportfs.h. This is the model what we should be using instead of phony BHs for block mapping/allocation operations... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html