Re: How To Recover Files From ext3 Partition??

Andreas Dilger <adilger@xxxxxxxxxxxxx> · Mon, 8 May 2006 11:41:08 -0600

On May 08, 2006  13:42 +0100, Stephen C. Tweedie wrote:
> On Mon, 2006-05-08 at 14:34 +0200, Erik Mouw wrote:
> 
> > > Trouble is, there's no guarantee that that transaction would actually
> > > fit into the journal.  Most of the time it will, but if it doesn't, then
> > > we deadlock or risk data corruption.  
> > 
> > Is there some way to determine in advance if a transaction fits into
> > the journal?
> 
> For truncate/delete, no, not easily.  Or rather, it's possible, but only
> for trivially short files.  The trouble is that we can't assume that all
> of the file's blocks are on the same block groups, so each block in the
> file is potentially an update to a new group descriptor and a new block
> bitmap (ie. 2 blocks of journal per block of file.)

Actually, the upper limit is the number of block groups in the filesystem.
In many cases this would be a sufficient upper bound that could be checked
without any effort.  Given that the default journal size is now 128MB
(32k blocks) this would guarantee full-file truncates in the worst case
for up (to 32k blocks / 4 / (1 + 1/128 blocks/group)) = 8128 groups = 1016GB
before we even have to consider walking the file.  We could also work out
the worst-case truncate by the number of blocks in the file.

> That's hugely pessimistic, of course, but it is the genuine worst-case
> scenario and we have to be prepared for it.  We only work out that we
> need less once we actually start walking the file's indirect tree, at
> which point the truncate is already under way.  
> 
> We _could_ walk the tree twice, but that would be unnecessarily
> expensive, especially for large files.

That was actually my thought.  I don't think it is expensive, given the
current implementation already has to read all of these blocks from disk
(in reverse order, I might add) and then _write_ to each of the indirect
blocks (1/1024 of the whole file size) to zero them out.

Instead, we could walk the file tree in forward order, doing async readahead
for the indirect, dindirect, tindirect blocks, then a second batch of
readahead for all of the indirect blocks from dindirect, and dindirect blocks
from tindirect, rinse repeat.  In the meantime we could walk the blocks and
count the block groups affected (= transaction size) while waiting for the
next batch of blocks to complete IO.  Since we need to read these blocks in
any case, doing the readahead efficiently will likely improve performance
of this step.

Rather than hurt performance, I think this will actually improve the truncate
performance because we don't need to do ANY indirect block writes, removing
1 block write per 1MB of file space, only doing the (already required) write
to the group descriptor and bitmap.  Given that we try hard to do contiguous
allocations for files this will usually be a relatively small number of
blocks, as few as 1/128MB of the file size (group size limit).  If we do the
file walk, instead of being completely pessimistic, we can also reduce the
pressure on journal flushes, which I suspect would be more costly than the
walk itself.

The fact that this also fixes undelete was actually a side-effect, IMHO.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html