On May 08, 2006 13:42 +0100, Stephen C. Tweedie wrote: > On Mon, 2006-05-08 at 14:34 +0200, Erik Mouw wrote: > > > > Trouble is, there's no guarantee that that transaction would actually > > > fit into the journal. Most of the time it will, but if it doesn't, then > > > we deadlock or risk data corruption. > > > > Is there some way to determine in advance if a transaction fits into > > the journal? > > For truncate/delete, no, not easily. Or rather, it's possible, but only > for trivially short files. The trouble is that we can't assume that all > of the file's blocks are on the same block groups, so each block in the > file is potentially an update to a new group descriptor and a new block > bitmap (ie. 2 blocks of journal per block of file.) Actually, the upper limit is the number of block groups in the filesystem. In many cases this would be a sufficient upper bound that could be checked without any effort. Given that the default journal size is now 128MB (32k blocks) this would guarantee full-file truncates in the worst case for up (to 32k blocks / 4 / (1 + 1/128 blocks/group)) = 8128 groups = 1016GB before we even have to consider walking the file. We could also work out the worst-case truncate by the number of blocks in the file. > That's hugely pessimistic, of course, but it is the genuine worst-case > scenario and we have to be prepared for it. We only work out that we > need less once we actually start walking the file's indirect tree, at > which point the truncate is already under way. > > We _could_ walk the tree twice, but that would be unnecessarily > expensive, especially for large files. That was actually my thought. I don't think it is expensive, given the current implementation already has to read all of these blocks from disk (in reverse order, I might add) and then _write_ to each of the indirect blocks (1/1024 of the whole file size) to zero them out. Instead, we could walk the file tree in forward order, doing async readahead for the indirect, dindirect, tindirect blocks, then a second batch of readahead for all of the indirect blocks from dindirect, and dindirect blocks from tindirect, rinse repeat. In the meantime we could walk the blocks and count the block groups affected (= transaction size) while waiting for the next batch of blocks to complete IO. Since we need to read these blocks in any case, doing the readahead efficiently will likely improve performance of this step. Rather than hurt performance, I think this will actually improve the truncate performance because we don't need to do ANY indirect block writes, removing 1 block write per 1MB of file space, only doing the (already required) write to the group descriptor and bitmap. Given that we try hard to do contiguous allocations for files this will usually be a relatively small number of blocks, as few as 1/128MB of the file size (group size limit). If we do the file walk, instead of being completely pessimistic, we can also reduce the pressure on journal flushes, which I suspect would be more costly than the walk itself. The fact that this also fixes undelete was actually a side-effect, IMHO. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html