Hi Ted, I've been looking into the bug Amir Goldstein found (way back in August) as well as the patch you created for it. After examining both the patch and some the code for delayed allocation and fallocate, I find that I don't understand what is actually going on to cause the bug. I'll outline what I've found so far below. Maybe you can help me fill in some of the gaps or point me in the right direction if I've made a mistake. Delayed allocation: Delayed allocation is enabled if the `da` set of aops is used in the ext4 code (set sometime close to file system mount time). When a delayed allocation write occurs, a function higher up the storage stack calls into ext4_da_write_begin. This function will go through the rb tree representing the extents for a file and either 1. find blocks already allocated that can be used for the write or 2. reserve space in the extent tree. The bh returned from this function informs later code about whether preexisting blocks were found or if space was reserved. ext4_da_write_begin also appears to create a journal handle (with a handle on the stack), but never closes the handle unless an error occurs. After ext4_da_write_begin, control returns to some functions higher in the storage stack before going into ext4_da_write_end. This function conditionally updates the i_disksize value of the file if non-delay allocated extents were used and the file size was extended. It also closes the journal handle opened in ext4_da_begin write. Furthermore, it calls generic_write_end. generic_write_end updates the i_size value if needed, and also marks the inode as dirty. By marking the inode as dirty, a bdi_writeback is scheduled later that will call ext4_writepages. ext4_writepages resolves the delayed allocation block mappings and updates i_disksize as needed. Each update to the extent tree on disk and i_disksize is done using a single journal handle. fallocate (just fallocate to change the file size, not to collapse ranges or anything else): fallocate will check if offset + len > the current i_size value and, if it is, set new_size. Later, in a call to ext4_alloc_file_blocks (which modifies the extent tree, using a journal handle it makes, for the fallocate call), new_size is checked to determine if i_disksize needs to be updated, and, if it does, it writes that update to the current journal handle. Otherwise, no update to i_disksize is made. zero_range: The main portion of zero_range appears to be copy/paste from fallocate code. However, one notable difference is that is has the possibility of opening three different journal handles (2 from calls to ext4_alloc_file_blocks, and one later in zero_range for writing out blocks and the updated inode size. In this function, new_size is used to update i_disksize in both the calls to ext4_alloc_file_blocks and at the bottom of the function (so they will all either update i_disksize or not update it). What I don't really understand from the above is how the patch for Amir's bug fixes the bug. It appears that what is happening in the bug is an old value of i_disksize is being persisted while a new extent tree is being persisted. I am confused how adding an extra clause to the if statement in fallocate and zero_range (which causes new_size to be set on more conditions) later translates into correct behavior. It seems like in order to get incorrect behavior in the first place, you would have to have the file size updated in a different journal transaction than the extent tree so that, on replay, one is updated but the other is not. Another piece of the puzzle I am missing is when the extent tree is journaled in fallocate and zero_range. I can clearly see the i_disksize updates being journaled in the call to ext4_mark_inode_dirty, but it doesn't appear that that function also persists the extent tree. Could you give me any clarifications on when the extent tree is journaled and some more of the logic behind the patch for Amir's bug? Thanks, Ashlie