tytso@xxxxxxx writes: > On Sun, Dec 27, 2009 at 10:51:59PM -0500, tytso@xxxxxxx wrote: >> OK, i've been able to reproduce the problem using xfsqa test #74 >> (fstest) when an ext3 file system is mounted the ext4 file system >> driver. I was then able to bisect it down to commit d21cd8f6, which >> was introduced between 2.6.33-rc1 and 2.6.33-rc2, as part of >> quota/ext4 patch series pushed by Jan. > > OK, here's a patch which I think should avoid the BUG in > fs/ext4/inode.c. It should fix the regression, but in the long run we > need to pretty seriously rethink how we account for the need for > potentially new meta-data blocks when doing delayed allocation. > > The remaining problem with this machinery is that > ext4_da_update_reserve_space() and ext4_da_release_space() is that > they both try to calculate how many metadata blocks will potentially > required by calculating ext4_calc_metadata_amount() based on the > number of delayed allocation blocks found in i_reserved_data_blocks. > The problem is that ext4_calc_metadata_amount() assumes that the > number of blocks passed to it is contiguous, and what might be left > remaining to be written in the page cache could be anything but > contiguous. This is a problem which has always been there, so it's > not a regression per se; just a design flaw. Hello, I've finally able to reproduce the issue. I'm agree with your diagnose. But while looking in to code i've found some questions see late in the message. > > The patch below should fixes the regression caused by commit d21cd8f, > but we need to look much more closely to find a better way of > accounting for the potential need for metadata for inodes facing > delayed allocation. Could people who are having problems with the BUG > in line 1063 of fs/ext4/inode.c try this patch? > > Thanks!! > > - Ted > > > commit 48b71e562ecd35ab12f6b6420a92fb3c9145da92 > Author: Theodore Ts'o <tytso@xxxxxxx> > Date: Wed Dec 30 00:04:04 2009 -0500 > > ext4: Patch up how we claim metadata blocks for quota purposes > > Commit d21cd8f triggered a BUG in the function > ext4_da_update_reserve_space() found in fs/ext4/inode.c, which was > caused by fact that ext4_calc_metadata_amount() can over-estimate how > many metadata blocks will be needed, especially when using direct > block-mapped files. Work around this by not claiming any excess > metadata blocks than we are prepared to claim at this point. > > Signed-off-by: "Theodore Ts'o" <tytso@xxxxxxx> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 3e3b454..d6e84b4 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1058,14 +1058,23 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used) > mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb; > > if (mdb_free) { > - /* Account for allocated meta_blocks */ > + /* > + * Account for allocated meta_blocks; it is possible > + * for us to have allocated more meta blocks than we > + * are prepared to free at this point. This is > + * because ext4_calc_metadata_amount can over-estimate > + * how many blocks are still needed. So we may not be > + * able to claim all of the allocated meta blocks > + * right away. The accounting will work out in the end... > + */ > mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks; > - BUG_ON(mdb_free < mdb_claim); > + if (mdb_free < mdb_claim) > + mdb_claim = mdb_free; > mdb_free -= mdb_claim; > > /* update fs dirty blocks counter */ > percpu_counter_sub(&sbi->s_dirtyblocks_counter, mdb_free); > - EXT4_I(inode)->i_allocated_meta_blocks = 0; > + EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim; > EXT4_I(inode)->i_reserved_meta_blocks = mdb; > } > > @@ -1845,7 +1854,7 @@ repeat: > static void ext4_da_release_space(struct inode *inode, int to_free) > { > struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb); > - int total, mdb, mdb_free, release; > + int total, mdb, mdb_free, mdb_claim, release; > > if (!to_free) > return; /* Nothing to release, exit */ > @@ -1874,6 +1883,16 @@ static void ext4_da_release_space(struct inode *inode, int to_free) > BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks); > mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb; > > + if (mdb_free) { > + /* Account for allocated meta_blocks */ > + mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks; > + if (mdb_free < mdb_claim) > + mdb_claim = mdb_free; > + mdb_free -= mdb_claim; > + > + EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim; > + } > + Seems what this is not enough. Just imagine, we may have following call-trace: userspace pwrite(fd, d, 1000, off) ->ext4_da_reserve_space(inode, 1000) ->dq_reserve_space(1000 + md_needed) userspace ftruncate(fd, off) /* "off" is the same as in pwrite call */ ->ext4_da_invalidatepage() ->ext4_da_page_release_reservation() ->ext4_da_release_space() <<< And we decrease ->i_allocated_meta_blocks only if (mdb_free > 0) userspace close(fd) So reserved metadata blocks will leak. I'm able to reproduce it like this: quotacheck -cu /mnt quotaon /mnt fsstres -p 16 -d /mnt -l999999999 -n99999999& sleep 180 killall -9 fsstress sync; sync; cp /mnt/aquota.user > q1 quotaoff /mnt quotacheck -cu /mnt/ # recaculate real quota usage. cp /mnt/aquota.user > q2 diff -up q1 q2 # in my case i've found 1 block leaked. IMHO we may drop i_allocated_meta_block in ext4_release_file() But while looking in to this function i've found another question about locking static int ext4_release_file(struct inode *inode, struct file *filp) { if (EXT4_I(inode)->i_state & EXT4_STATE_DA_ALLOC_CLOSE) { ext4_alloc_da_blocks(inode); EXT4_I(inode)->i_state &= ~EXT4_STATE_DA_ALLOC_CLOSE; <<< Seems what i_state modification must being protected by i_mutex, but currently caller don't have to hold it. ..... } > release = to_free + mdb_free; > > /* update fs dirty blocks counter for truncate case */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html