On Thu, Oct 31, 2019 at 05:54:16PM +0100, Jan Kara wrote: > On Thu 31-10-19 20:16:41, Matthew Bobrowski wrote: > > On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote: > > > On Wed 30-10-19 12:26:52, Jan Kara wrote: > > > Hum, actually no. This write from fsx output: > > > > > > 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes) > > > > > > should have allocated blocks to where the failed write was going (0x24000). > > > But still I'd expect some interaction between how buffered writes to holes > > > interact with following direct IO writes... One of the subtle differences > > > we have introduced with iomap conversion is that the old code in > > > __generic_file_write_iter() did fsync & invalidate written range after > > > buffered write fallback and we don't seem to do that now (probably should > > > be fixed regardless of relation to this bug). > > > > After performing some debugging this afternoon, I quickly realised > > that the fix for this is rather trivial. Within the previous direct > > I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to > > ext4_map_blocks() for any writes to inodes without extents. I seem to > > have missed that here and consequently block allocation for a write > > wasn't performing correctly in such cases. > > No, this is not correct. For inodes without extents we used > ext4_dio_get_block() and we pass DIO_SKIP_HOLES to __blockdev_direct_IO(). > Now DIO_SKIP_HOLES means that if starting block is within i_size, we pass > 'create == 0' to get_blocks() function and thus ext4_dio_get_block() uses > '0' argument to ext4_map_blocks() similarly to what you do. Ah right, I missed that part. :( > And indeed for inodes without extents we must fallback to buffered IO for > filling holes inside a file to avoid stale data exposure (racing DIO read > could read block contents before data is written to it if we used > EXT4_GET_BLOCKS_CREATE). Well in this case I'm pretty sure I know exactly where the problem resides. I seem to be falling back to buffered I/O from ext4_dio_write_iter() without actually taking into account any of the data that may have partially been written by the direct I/O. So, when returning the bytes written back to userspace it's whatever actually is returned by ext4_buffered_write_iter(), which may not necessarily be the amount of bytes that were expected, so it should rather be ext4_dio_write_iter() + ext4_buffered_write_iter()... > > Also, I agree, the fsync + page cache invalidation bits need to be > > implemented. I'm just thinking to branch out within > > ext4_buffered_write_iter() and implement those bits there i.e. > > > > ... > > ret = generic_perform_write(); > > > > if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) { > > err = filemap_write_and_wait_range(); > > > > if (!err) > > invalidate_mapping_pages(); > > ... > > > > AFAICT, this would be the most appropriate place to put it? Or, did > > you have something else in mind? > > Yes, either this, or maybe in ext4_dio_write_iter() after returning from > ext4_buffered_write_iter() would be even more logical. Yes, let's stick with doing it within ext4_dio_write_iter(). --<M>--