在 2008-07-23三的 13:12 +0530,Aneesh Kumar K.V写道: > Hi Mingming, > > Comments below > On Tue, Jul 22, 2008 at 05:51:51PM -0700, Mingming Cao wrote: > > Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages > > > > From: Mingming Cao <cmm@xxxxxxxxxx> > > > > With delalloc, at writepages() time, we need to reserve enough credits to start > > a new handle, to allow possible multiple segment of block allocations under a > > single call mapge_da_writepages(), so that metadata updates could fit into a single > > transaction. This patch fixed this by calculating the needed credits for > > write-out given number of dirty pages, with the consideration of discontinues > > block allocations. It fixed both extent files and non extent files. > > > > This patch also fixed the journal credit reservation for DIO. Currently the > > estimated credits for DIO is only based on non extent format file. That credit > > is not enough for mballoc a single extent on extent based file. This patch > > fixed that. > > > > The fallocate double booking credits for modifying super block etc, this patch > > fixed that. > > > > This also fix credit reservation in migration and defrag code. > > > > Signed-off-by: Mingming Cao <cmm@xxxxxxxxxx> > > --- > > fs/ext4/defrag.c | 4 +- > > fs/ext4/ext4.h | 4 +- > > fs/ext4/extents.c | 37 ++++++++++++--------- > > fs/ext4/inode.c | 93 ++++++++++++++++++++++++------------------------------ > > fs/ext4/migrate.c | 4 +- > > 5 files changed, 72 insertions(+), 70 deletions(-) > > > > Index: linux-2.6.26-git6/fs/ext4/ext4.h > > =================================================================== > > --- linux-2.6.26-git6.orig/fs/ext4/ext4.h 2008-07-21 17:35:17.000000000 -0700 > > +++ linux-2.6.26-git6/fs/ext4/ext4.h 2008-07-21 17:36:17.000000000 -0700 > > @@ -1149,7 +1149,7 @@ extern void ext4_truncate (struct inode > > extern void ext4_set_inode_flags(struct inode *); > > extern void ext4_get_inode_flags(struct ext4_inode_info *); > > extern void ext4_set_aops(struct inode *inode); > > -extern int ext4_writepage_trans_blocks(struct inode *); > > +extern int ext4_writepages_trans_blocks(struct inode *, int num); > > extern int ext4_block_truncate_page(handle_t *handle, > > struct address_space *mapping, loff_t from); > > extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page); > > @@ -1314,7 +1314,7 @@ extern const struct inode_operations ext > > > > /* extents.c */ > > extern int ext4_ext_tree_init(handle_t *handle, struct inode *); > > -extern int ext4_ext_writepage_trans_blocks(struct inode *, int); > > +extern int ext4_ext_writepages_trans_blocks(struct inode *inode, int); > > extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, > > ext4_lblk_t iblock, > > unsigned long max_blocks, struct buffer_head *bh_result, > > Index: linux-2.6.26-git6/fs/ext4/extents.c > > =================================================================== > > --- linux-2.6.26-git6.orig/fs/ext4/extents.c 2008-07-21 17:35:17.000000000 -0700 > > +++ linux-2.6.26-git6/fs/ext4/extents.c 2008-07-21 17:36:17.000000000 -0700 > > @@ -1887,11 +1887,12 @@ static int ext4_ext_rm_idx(handle_t *han > > > > /* > > * ext4_ext_calc_credits_for_insert: > > - * This routine returns max. credits that the extent tree can consume. > > + * This routine returns max. credits that needed to insert an extent > > + * to the extent tree. > > * It should be OK for low-performance paths like ->writepage() > > * To allow many writing processes to fit into a single transaction, > > - * the caller should calculate credits under i_data_sem and > > - * pass the actual path. > > + * When pass the actual path, the caller should calculate credits > > + * under i_data_sem. > > */ > > int ext4_ext_calc_credits_for_insert(struct inode *inode, > > struct ext4_ext_path *path) > > @@ -1930,9 +1931,6 @@ int ext4_ext_calc_credits_for_insert(str > > */ > > needed += (depth * 2) + (depth * 2); > > > > - /* any allocation modifies superblock */ > > - needed += 1; > > - > > > Why are we dropping the super block modification credit ? An insert of > an extent can result in super block modification also. If the goal is to > use ext4_writepages_trans_blocks everywhere then this change is correct. > But i see many place not using ext4_writepages_trans_blocks. > ext4_writepages_trans_blocks() will take care of the credits need for updating superblock, for just once.ext4_ext_calc_credits_for_insert() calculate the credits needed for inserting a single extent, You could see in ext4_writepages_trans_blocks(), it will multiple it will the total numberof blocks. If we account for superblock credit in ext4_ext_calc_credits_for_insert(), then super block updates for multiple extents allocation will be overaccounted, and have to remove that later in ext4_writepages_trans_blocks() Other places calling ext4_ext_calc_credits_for_insert() (mostly migrate.c and defrag,c) are updated to add extra 4 credits, including superblock, inode block and quota blocks. > You also need to update the function comment saying that super block > update is not accounted.Also it doesn't account for block bitmap, > group descriptor and inode block update. > Credits for block bitmap and group descriptors are accounted in ext4_ext_calc_credits_for_insert() inode block update is only accounted once per writepages, so it's accounted in ext4_writepages_trans_blocks()/ext4_ext_writepages_trans_blocks() > > return needed; > > } > > > > @@ -2940,8 +2938,8 @@ void ext4_ext_truncate(struct inode *ino > > /* > > * probably first extent we're gonna free will be last in block > > */ > > - err = ext4_writepage_trans_blocks(inode) + 3; > > - handle = ext4_journal_start(inode, err); > > + handle = ext4_journal_start(inode, > > + ext4_writepages_trans_blocks(inode, 1) + 3); > > > What is +3 for ? Can you add a comment on that. I understand that it > was there before. I guess +3 is not needed.? > I guess it was there for superblock +inode block + quota block, but actually superblock and inode block and quota blocks are already accounted, it probably could be removed. I did not pay a lot attention to it, I guess a little overbooking probably safer, I could remove it if you feel strong about it. > > > > if (IS_ERR(handle)) > > return; > > > > @@ -2994,18 +2992,28 @@ out_stop: > > } > > > > /* > > - * ext4_ext_writepage_trans_blocks: > > + * ext4_ext_writepages_trans_blocks: > > * calculate max number of blocks we could modify > > * in order to allocate new block for an inode > > */ > > -int ext4_ext_writepage_trans_blocks(struct inode *inode, int num) > > +int ext4_ext_writepages_trans_blocks(struct inode *inode, int num) > > { > > I guess the name is misleading. @num above is number of blocks. how > about ext4_ext_writeblocks_trans(struct inode *node, int nrblocks) > > > Also add a comment stating we consider the worst case where each block > can result in an extent. > I will add a comment about the worse case, and change num to nrblocks. > > int needed; > > > > + /* cost of adding a single extent: > > + * index blocks, leafs, bitmaps, > > + * groupdescp > > + */ > > needed = ext4_ext_calc_credits_for_insert(inode, NULL); > > > > - /* caller wants to allocate num blocks, but note it includes sb */ > > - needed = needed * num - (num - 1); > > + /* > > + * For data=journalled mode need to account for the data blocks > > + * Also need to add super block and inode block > > + */ > > + if (ext4_should_journal_data(inode)) > > + needed = num * (needed + 1) + 2; > > + else > > + needed = num * needed + 2; > > > > > We are not accounting here for the bitmap and group descriptor. > ext4_ext_calc_credits_for_insert is not accounting for the credit need > to update bitmap and group descriptor. > We also need to account updating > bitmap and group descriptor for new blocks that would be allocated > as a part of extent insert. > > No need for that. It's being accounted in ext4_ext_calc_credits_for_insert(). In the worse case the tree depth is 5, inserting a extent would require a 2 updates (bitmap and group descriptor) for each level (including the leaf, the new blocks), for old tree and new tree. /* * Index split can happen, we would need: * allocate intermediate indexes (bitmap + group) * + change two blocks at each level, but root (already included) */ needed += (depth * 2) + (depth * 2); > > #ifdef CONFIG_QUOTA > > needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb); > > @@ -3074,10 +3082,9 @@ long ext4_fallocate(struct inode *inode, > > max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) > > - block; > > /* > > - * credits to insert 1 extent into extent tree + buffers to be able to > > - * modify 1 super block, 1 block bitmap and 1 group descriptor. > > + * credits to insert 1 extent into extent tree > > */ > > - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3; > > + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb); > > > > Why is this change ? modify for block bitmap and group descriptor is included in EXT4_DATA_TRANS_BLOCKS(inode->i_sb)-> (EXT4_SINGLEDATA_TRANS_BLOCKS(sb) /* Define the number of blocks we need to account to a transaction to * modify one block of data. * * We may have to touch one inode, one bitmap buffer, up to three * indirection blocks, the group and superblock summaries, and the data * block to complete the transaction. * * For extents-enabled fs we may have to allocate and modify up to * 5 levels of tree + root which are stored in the inode. */ #define EXT4_SINGLEDATA_TRANS_BLOCKS(sb) \ (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS) \ || test_opt(sb, EXTENTS) ? 27U : 8U) others including superblock, inode block, quota and xattr blocks are also included in /* Define the minimum size for a transaction which modifies data. This * needs to take into account the fact that we may end up modifying two * quota files too (one for the group, one for the user quota). The * superblock only gets updated once, of course, so don't bother * counting that again for the quota updates. */ #define EXT4_DATA_TRANS_BLOCKS(sb) (EXT4_SINGLEDATA_TRANS_BLOCKS(sb) + \ EXT4_XATTR_TRANS_BLOCKS - 2 + \ 2*EXT4_QUOTA_TRANS_BLOCKS(sb)) > > > mutex_lock(&inode->i_mutex); > > retry: > > while (ret >= 0 && ret < max_blocks) { > > Index: linux-2.6.26-git6/fs/ext4/inode.c > > =================================================================== > > --- linux-2.6.26-git6.orig/fs/ext4/inode.c 2008-07-21 17:35:17.000000000 -0700 > > +++ linux-2.6.26-git6/fs/ext4/inode.c 2008-07-21 17:44:45.000000000 -0700 > > @@ -1015,15 +1015,6 @@ static void ext4_da_update_reserve_space > > > > /* Maximum number of blocks we map for direct IO at once. */ > > #define DIO_MAX_BLOCKS 4096 > > -/* > > - * Number of credits we need for writing DIO_MAX_BLOCKS: > > - * We need sb + group descriptor + bitmap + inode -> 4 > > - * For B blocks with A block pointers per block we need: > > - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect). > > - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25. > > - */ > > -#define DIO_CREDITS 25 > > - > > > > /* > > * > > @@ -1142,13 +1133,13 @@ int ext4_get_block(struct inode *inode, > > handle_t *handle = ext4_journal_current_handle(); > > int ret = 0, started = 0; > > unsigned max_blocks = bh_result->b_size >> inode->i_blkbits; > > + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb); > > Again this should be > int dio_credits = ext4_writeblocks_trans(inode, DIO_MAX_BLOCKS); > No. DIO case is different than writepages(). When get_block() is called from DIO path, the get_block() is called in a loop, and the credit is reserved inside the loop. Each time get_block(),will return a single extent, so we should use EXT4_DATA_TRANS_BLOCKS(inode->i_sb) which is calculate a single chunk of allocation credits. ext4_da_writepages() is different, we have to reserve the credits outside the loop of calling get_block(), since the underlying get_block() could be called multiple times, we need to worse case, so ext4_writepages_trans_blocks() is called to handling the worse case. > > > > if (create && !handle) { > > /* Direct IO write... */ > > if (max_blocks > DIO_MAX_BLOCKS) > > max_blocks = DIO_MAX_BLOCKS; > > - handle = ext4_journal_start(inode, DIO_CREDITS + > > - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb)); > > + handle = ext4_journal_start(inode, dio_credits); > > if (IS_ERR(handle)) { > > ret = PTR_ERR(handle); > > goto out; > > @@ -1327,7 +1318,7 @@ static int ext4_write_begin(struct file > > struct page **pagep, void **fsdata) > > { > > struct inode *inode = mapping->host; > > - int ret, needed_blocks = ext4_writepage_trans_blocks(inode); > > + int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1); > > handle_t *handle; > > int retries = 0; > > struct page *page; > > @@ -2179,18 +2170,7 @@ static int ext4_da_writepage(struct page > > return ret; > > } > > > > -/* > > - * For now just follow the DIO way to estimate the max credits > > - * needed to write out EXT4_MAX_WRITEBACK_PAGES. > > - * todo: need to calculate the max credits need for > > - * extent based files, currently the DIO credits is based on > > - * indirect-blocks mapping way. > > - * > > - * Probably should have a generic way to calculate credits > > - * for DIO, writepages, and truncate > > - */ > > #define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS > > -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS > > > > static int ext4_da_writepages(struct address_space *mapping, > > struct writeback_control *wbc) > > @@ -2210,13 +2190,8 @@ static int ext4_da_writepages(struct add > > if (!mapping->nrpages) > > return 0; > > > > - /* > > - * Estimate the worse case needed credits to write out > > - * EXT4_MAX_BUF_BLOCKS pages > > - */ > > - needed_blocks = EXT4_MAX_WRITEBACK_CREDITS; > > - > > to_write = wbc->nr_to_write; > > + > > if (!wbc->range_cyclic) { > > /* > > * If range_cyclic is not set force range_cont > > @@ -2227,6 +2202,20 @@ static int ext4_da_writepages(struct add > > } > > > > while (!ret && to_write) { > > + /* > > + * set the max dirty pages could be write at a time > > + * to fit into the reserved transaction credits > > + */ > > + if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES) > > + wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES; > > + > > + /* > > + * Estimate the worse case needed credits to write out > > + * to_write pages > > + */ > > + needed_blocks = ext4_writepages_trans_blocks(inode, > > + wbc->nr_to_write); > > + > > /* start a new transaction*/ > > handle = ext4_journal_start(inode, needed_blocks); > > if (IS_ERR(handle)) { > > @@ -2246,12 +2235,6 @@ static int ext4_da_writepages(struct add > > } > > > > } > > - /* > > - * set the max dirty pages could be write at a time > > - * to fit into the reserved transaction credits > > - */ > > - if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES) > > - wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES; > > > > to_write -= wbc->nr_to_write; > > ret = mpage_da_writepages(mapping, wbc, > > @@ -2612,7 +2595,8 @@ static int __ext4_journalled_writepage(s > > * references to buffers so we are safe */ > > unlock_page(page); > > > > - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode)); > > + handle = ext4_journal_start(inode, > > + ext4_writepages_trans_blocks(inode, 1)); > > if (IS_ERR(handle)) { > > ret = PTR_ERR(handle); > > goto out; > > @@ -4286,20 +4270,20 @@ int ext4_getattr(struct vfsmount *mnt, s > > /* > > * How many blocks doth make a writepage()? > > * > > - * With N blocks per page, it may be: > > - * N data blocks > > + * With N blocks per page, and P pages, it may be: > > + * N*P data blocks > > * 2 indirect block > > * 2 dindirect > > * 1 tindirect > > - * N+5 bitmap blocks (from the above) > > - * N+5 group descriptor summary blocks > > + * N*P+5 bitmap blocks (from the above) > > + * N*P+5 group descriptor summary blocks > > * 1 inode block > > * 1 superblock. > > * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files > > * > > - * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS > > + * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS > > * > > - * With ordered or writeback data it's the same, less the N data blocks. > > + * With ordered or writeback data it's the same, less the N*P data blocks. > > * > > * If the inode's direct blocks can hold an integral number of pages then a > > * page cannot straddle two indirect blocks, and we can only touch one indirect > > @@ -4310,19 +4294,15 @@ int ext4_getattr(struct vfsmount *mnt, s > > * block and work out the exact number of indirects which are touched. Pah. > > */ > > > > -int ext4_writepage_trans_blocks(struct inode *inode) > > +static int ext4_writepages_trans_blocks_old(struct inode *inode, int num) > > { > > a better name would be > > static int ext4_writeblocks_trans_old(struct inode *inode, int nrblocks) > > > > > - int bpp = ext4_journal_blocks_per_page(inode); > > - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3; > > + int indirects = (EXT4_NDIR_BLOCKS % num) ? 5 : 3; > > int ret; > > > > - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) > > - return ext4_ext_writepage_trans_blocks(inode, bpp); > > - > > if (ext4_should_journal_data(inode)) > > - ret = 3 * (bpp + indirects) + 2; > > + ret = 3 * (num + indirects) + 2; > > else > > - ret = 2 * (bpp + indirects) + 2; > > + ret = 2 * (num + indirects) + 2; > > > With non journalled moded we should just decrease num. I guess the above > should be > if (ext4_should_journal_data(inode)) > ret = 3 * (num + indirects) + 2; > else > ret = 3 * (num + indirects) + 2 - num; > > > Well, I think in the journalled mode we need to journal the content of the indirect/double indirect blocks also, no? Mingming -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html