On Wed, May 13, 2009 at 06:28:47PM -0400, Theodore Tso wrote: > On Wed, May 13, 2009 at 01:56:56PM -0500, Eric Sandeen wrote: > > The comment: > > > > /* > > + * The above get_blocks can cause the buffer to be > > + * marked unwritten. So clear the same. > > + */ > > + clear_buffer_unwritten(bh); > > > > is imho not helpful; to me it says "we -just- set this, so clear it!" > > It leaves me scratching my head. > > I've updated it the comment to say this instead. > > /* > * When we call get_blocks without the create flag, the > * BH_Unwritten flag could have gotten set if the blocks > * requested were part of a uninitialized extent. We need to > * clear this flag now that we are committed to convert all or > * part of the uninitialized extent to be an initialized > * extent. This is because we need to avoid the combination > * of BH_Unwritten and BH_Mapped flags being simultaneously > * set on the buffer_head. > */ The last line in the above comment is not a problem with the latest kernel with all the patches in the patch queue. The patch that does that is "ext4: Mark the unwritten buffer_head as mapped during write_begin" The unwritten and mapped state together was a problem with the code path we had before in ext4_da_get_block_prep we had: ret = ext4_get_blocks(NULL, inode, iblock, 1, bh_result, 0); ..... .. } else if (ret > 0) { bh_result->b_size = (ret << inode->i_blkbits); if (buffer_unwritten(bh_result)) { bh_result.b_blocknr = ~0; .... } } The above can result in 1) We do a multi block delayed alloc to prealloc space. That would get us multiple buffer_heads marked with BH_Unwritten. (say 10, 11, 12) 2) pdflush attempt to write some pages (say mapping block 10) which cause a get_block call with create = 1. That would attempt to convert uninitialized extent to initialized one. This can cause multiple blocks to be marked initialized. ( say 10, 11 , 12) 3) We do an overwrite of block 11. That would mean we call ext4_da_get_block_prep, which would again do a get_block for block 11 with create = 0. But remember we already have buffer_head marked with BH_Unwritten flag. But the buffer was unmapped because it is unwritten ( We are fixing this mess in the patch for 2.6.31). 4) The get_block call will find the buffer mapped due to step b. And mark the buffer_head mapped. There we go . We end up with buffer_head mapped and unwritten 5) later in ext4_da_get_block_prep we check whether the buffer_head in marked BH_Unwritten if so we set the block number to ~0. This is introduced by [PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents 6) So now we have a buffer_head that is mapped, unwritten, with b_blocknr = ~0. That would result in the I/O error. Now that we have the actual block number in bh_result.b_blocknr I guess we should be ok to have the unwritten flag set. But then i guess it is against the VFS assumption of buffer_head state. The state unwritten indicate the blocks are not yet written. So if we don't do a clear_buffer_unwritten as in the patch we end up with a block that is written ( done via create = 1 get_block call) and still marked unwritten. (see the 6 step example above) Now to explain what "ext4: Mark the unwritten buffer_head as mapped during write_begin" does. It marks the buffer_head as mapped in the write_begin phase. And that helps in making sure we don't end up calling get_block multiple times. So with delayed allocation we now have between the write_begin and writepage phase a buffer_head which is mapped and unwritten for blocks in the fallocate space. Once we do writepage for the block we will have buffer_head which is mapped and unwritten flag cleared. The unwritten get cleared when we do a get_block call with create = 1. To achieve the above we need to make sure writepage code path looks at the unwritten flag and does a get_blocks call with create = 1. With mainline we have in writepage code path mpage_da_map_blocks(...) if ((mpd->b_state & (1 << BH_Mapped)) && !(mpd->b_state & (1 << BH_Delay))) return 0; ... .. ext4_da_get_block_write() If we start marking buffer_head mapped for fallocate blocks in the write_begin phase, then the above code will return without doing any get_block(create = 1) call. That means we don't convert the uninitialized extent to initialized one. So along with marking buffer_head mapped and unwritten in the write_begin phase we also need changes to writepage code path which does something like below mpage_da_map_blocks(...) if ((mpd->b_state & (1 << BH_Mapped)) && !(mpd->b_state & (1 << BH_Delay)) && !(mpd->b_state & (1 << BH_Unwritten))) return 0; ... .. ext4_da_get_block_write() this is done in patch "ext4: Mark the unwritten buffer_head as mapped during write_begin". -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html