On Jul 18, 2008 09:43 +0900, m-ota@xxxxxxxxxxxxx wrote: > ext4 online defrag exchanges the data block in the following procedures. > > 1. Creates a temporary inode and allocates contiguous blocks. > 2. Read data from original file to memory page by write_begin() > 3. Swap the blocks between the original inode and the temporary inode. > Updates the extent tree and registers the block to transaction by > ext4_journal_dirty_metadata(). > 4. Write data in memory page to new blocks by write_end(). > > In the current implementation, when the block swap failed, > data could not move to the new block. > So the defrag process exits without calling write_end(). > We try to defrag for the same file again, but the defrag process seems to stall. > After defrag process stalled, all acess to the file systems like "ls" command > also stall. > Both processes wait for unlock j_wait_transaction_locked. > > If the block exchange between write_begin() and write_end() failed, > what should I do? It sounds like you are not closing the transaction correctly in the case of the failed block swap. One important rule when writing ext3/ext4 code is to try and ensure all possible failure conditions are handled BEFORE starting the journal operation. It does not seem necessary to do the allocation and writing of the temprorary inode under the same transaction as the block swapping as long as it is in the orphan inode list with i_nlink == 0. A first transaction can be started to allocate the temporary inode, add it to the orphan list, and then close the transaction. Then, if the system crashes during the defrag then the temporary inode will be removed at and all allocated blocks freed at e2fsck/remount time like an open-unlinked file would. Multiple transactions may be needed for doing the file copying, depending on the size of the blocks being copied. Lustre could always do 1MB writes in a single transaction without problems, without doing data journaling. You can try to start a single transaction large enough to allocate, say, min(file size, 4MB) blocks, and then if journal_start() returns -ENOSPC reduce the allocation size by 1/2 each time. A separate transaction can be used to do the copying of the data into the temporary inode (with journal_dirty_metadata() as you say to avoid the need to always fsync). Then, once the copy is finished a separate transaction should be started to do the final swapping of the i_block[] array in the inode and freeing of the temporary inode. It shouldn't really be possible to fail at that point. The other question I had about the defragmenter is that it would be excellent if it is possible to "defragment" a block-mapped file into an extent-mapped file. This should be relatively easy so long as there as the whole file is "defragmented" and then the i_block[] array is swapped with the original inode and EXT4_EXTENTS_FL is set on the inode. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html