Ext4: journal credits reservation fixes for DIO, fallocate and delalloc writepages From: Mingming Cao <cmm@xxxxxxxxxx> With delalloc, at writepages() time, we need to reserve enough credits to start a new handle, to allow possible multiple segment of block allocations under a single call mapge_da_writepages(), to fit metadata updates into the single transaction. This patch fixed this by calculating the needed credits for write-out given number of dirty pages, with the consideration of discontinues block allocations. It fixed both extent files and non extent files. This patch also fixed the journal credit reservation for DIO. Currently the estimated credits for DIO is only based on non extent format file. That credit is not enough for mballoc a single extent on extent based file. This patch fixed that. The fallocate double booking credits for modifying super block etc, this patch fixed that. This also fix credit reservation in migration and defrag code. Changes since v2: 1) fix writepages() inefficency issue. sync() will invoke writepages() twice( not sure exactly why), the second time all the pages are clean so it waste the cpu time to walk though all pages and find they are not dirty . But it's simple to workaround by skip writepages() if there is no dirty pages pointed by the mapping. 2) extent based credit calculate is quit conservetive. It always use the max possible depth to estimate the needed credits to support extent insert/tree split. In fact the depth info for each inode is quite easy to get, so we could use more accurate info to calculate 3) Limit the max number of pages that could flush at once from ext4_da_writepages(), so that the max possible transaction credits could fit under the allowed credits for starting a new transaction. Reduce the number of pages to flush if necesary. Currently with 4K page size and 4K block size, with extent file, it's possible to flush about 1K pages under a single transaction. Verified with memory pressure case and umount case, Signed-off-by: Mingming Cao <cmm@xxxxxxxxxx> --- fs/ext4/ext4.h | 4 - fs/ext4/ext4_extents.h | 3 - fs/ext4/ext4_jbd2.h | 10 ++++ fs/ext4/extents.c | 78 ++++++++++++++++++------------- fs/ext4/inode.c | 120 ++++++++++++++++++++++++++----------------------- fs/ext4/migrate.c | 6 +- 6 files changed, 129 insertions(+), 92 deletions(-) Index: linux-2.6.26git6/fs/ext4/ext4.h =================================================================== --- linux-2.6.26git6.orig/fs/ext4/ext4.h 2008-07-28 22:47:22.000000000 -0700 +++ linux-2.6.26git6/fs/ext4/ext4.h 2008-07-29 17:40:40.000000000 -0700 @@ -1072,7 +1072,7 @@ extern void ext4_truncate (struct inode extern void ext4_set_inode_flags(struct inode *); extern void ext4_get_inode_flags(struct ext4_inode_info *); extern void ext4_set_aops(struct inode *inode); -extern int ext4_writepage_trans_blocks(struct inode *); +extern int ext4_writepages_trans_blocks(struct inode *, int nrpages); extern int ext4_block_truncate_page(handle_t *handle, struct address_space *mapping, loff_t from); extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page); @@ -1227,7 +1227,7 @@ extern const struct inode_operations ext /* extents.c */ extern int ext4_ext_tree_init(handle_t *handle, struct inode *); -extern int ext4_ext_writepage_trans_blocks(struct inode *, int); +extern int ext4_ext_writeblocks_trans_credits(struct inode *inode, int); extern int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ext4_lblk_t iblock, unsigned long max_blocks, struct buffer_head *bh_result, Index: linux-2.6.26git6/fs/ext4/extents.c =================================================================== --- linux-2.6.26git6.orig/fs/ext4/extents.c 2008-07-28 22:53:20.000000000 -0700 +++ linux-2.6.26git6/fs/ext4/extents.c 2008-07-29 17:40:50.000000000 -0700 @@ -1747,34 +1747,43 @@ static int ext4_ext_rm_idx(handle_t *han } /* - * ext4_ext_calc_credits_for_insert: - * This routine returns max. credits that the extent tree can consume. + * ext4_ext_calc_credits_for_single_extent: + * This routine returns max. credits that needed to insert an extent + * to the extent tree. * It should be OK for low-performance paths like ->writepage() * To allow many writing processes to fit into a single transaction, - * the caller should calculate credits under i_data_sem and - * pass the actual path. + * When pass the actual path, the caller should calculate credits + * under i_data_sem. + * + * For inserting a single extent, in the worse case extent tree depth is 5 + * for old tree and new tree, for every level we need to reserve + * credits to log the bitmap and block group descriptors + * + * credit needed for the update of super block + inode block + quota files + * are not included here. The caller of this function need to take care of this. */ -int ext4_ext_calc_credits_for_insert(struct inode *inode, +int ext4_ext_calc_credits_for_single_extent(struct inode *inode, struct ext4_ext_path *path) { int depth, needed; + depth = ext_depth(inode); + if (path) { /* probably there is space in leaf? */ - depth = ext_depth(inode); if (le16_to_cpu(path[depth].p_hdr->eh_entries) < le16_to_cpu(path[depth].p_hdr->eh_max)) - return 1; + /* 1 for block bitmap, 1 for group descriptor */ + return 2; } - /* - * given 32-bit logical block (4294967296 blocks), max. tree - * can be 4 levels in depth -- 4 * 340^4 == 53453440000. - * Let's also add one more level for imbalance. - */ - depth = 5; + /* add one more level in case of tree increase when insert a extent */ + depth += 1; - /* allocation of new data block(s) */ + /* + * bitmap blocks and group descriptor block for + * allocation of new extent + */ needed = 2; /* @@ -1791,9 +1800,6 @@ int ext4_ext_calc_credits_for_insert(str */ needed += (depth * 2) + (depth * 2); - /* any allocation modifies superblock */ - needed += 1; - return needed; } @@ -1917,9 +1923,7 @@ ext4_ext_rm_leaf(handle_t *handle, struc correct_index = 1; credits += (ext_depth(inode)) + 1; } -#ifdef CONFIG_QUOTA credits += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb); -#endif err = ext4_ext_journal_restart(handle, credits); if (err) @@ -2801,8 +2805,8 @@ void ext4_ext_truncate(struct inode *ino /* * probably first extent we're gonna free will be last in block */ - err = ext4_writepage_trans_blocks(inode) + 3; - handle = ext4_journal_start(inode, err); + handle = ext4_journal_start(inode, + ext4_writepages_trans_blocks(inode, 1) + 3); if (IS_ERR(handle)) return; @@ -2855,22 +2859,32 @@ out_stop: } /* - * ext4_ext_writepage_trans_blocks: + * ext4_ext_writeblocks_trans_credits: * calculate max number of blocks we could modify - * in order to allocate new block for an inode + * in order to allocate the required number of new blocks + * + * In the worse case, one block per extent. + * */ -int ext4_ext_writepage_trans_blocks(struct inode *inode, int num) +int ext4_ext_writeblocks_trans_credits(struct inode *inode, int nrblocks) { int needed; - needed = ext4_ext_calc_credits_for_insert(inode, NULL); - - /* caller wants to allocate num blocks, but note it includes sb */ - needed = needed * num - (num - 1); + /* cost of adding a single extent: + * index blocks, leafs, bitmaps, + * groupdescp + */ + needed = ext4_ext_calc_credits_for_single_extent(inode, NULL); + /* + * For data=journalled mode need to account for the data blocks + * Also need to add super block and inode block + */ + if (ext4_should_journal_data(inode)) + needed = nrblocks * (needed + 1) + 2; + else + needed = nrblocks * needed + 2; -#ifdef CONFIG_QUOTA needed += 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb); -#endif return needed; } @@ -2935,10 +2949,9 @@ long ext4_fallocate(struct inode *inode, max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits) - block; /* - * credits to insert 1 extent into extent tree + buffers to be able to - * modify 1 super block, 1 block bitmap and 1 group descriptor. + * credits to insert 1 extent into extent tree */ - credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb) + 3; + credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb); mutex_lock(&inode->i_mutex); retry: while (ret >= 0 && ret < max_blocks) { Index: linux-2.6.26git6/fs/ext4/inode.c =================================================================== --- linux-2.6.26git6.orig/fs/ext4/inode.c 2008-07-28 22:53:21.000000000 -0700 +++ linux-2.6.26git6/fs/ext4/inode.c 2008-07-29 17:45:43.000000000 -0700 @@ -1,5 +1,5 @@ /* - * linux/fs/ext4/inode.c + * linux/fs/ext4/inode.c * * Copyright (C) 1992, 1993, 1994, 1995 * Remy Card (card@xxxxxxxxxxx) @@ -954,15 +954,6 @@ out: /* Maximum number of blocks we map for direct IO at once. */ #define DIO_MAX_BLOCKS 4096 -/* - * Number of credits we need for writing DIO_MAX_BLOCKS: - * We need sb + group descriptor + bitmap + inode -> 4 - * For B blocks with A block pointers per block we need: - * 1 (triple ind.) + (B/A/A + 2) (doubly ind.) + (B/A + 2) (indirect). - * If we plug in 4096 for B and 256 for A (for 1KB block size), we get 25. - */ -#define DIO_CREDITS 25 - /* * @@ -1082,13 +1073,13 @@ static int ext4_get_block(struct inode * handle_t *handle = ext4_journal_current_handle(); int ret = 0, started = 0; unsigned max_blocks = bh_result->b_size >> inode->i_blkbits; + int dio_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb); if (create && !handle) { /* Direct IO write... */ if (max_blocks > DIO_MAX_BLOCKS) max_blocks = DIO_MAX_BLOCKS; - handle = ext4_journal_start(inode, DIO_CREDITS + - 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb)); + handle = ext4_journal_start(inode, dio_credits); if (IS_ERR(handle)) { ret = PTR_ERR(handle); goto out; @@ -1267,7 +1258,7 @@ static int ext4_write_begin(struct file struct page **pagep, void **fsdata) { struct inode *inode = mapping->host; - int ret, needed_blocks = ext4_writepage_trans_blocks(inode); + int ret, needed_blocks = ext4_writepages_trans_blocks(inode, 1); handle_t *handle; int retries = 0; struct page *page; @@ -2153,20 +2144,6 @@ static int ext4_da_writepage(struct page return ret; } - -/* - * For now just follow the DIO way to estimate the max credits - * needed to write out EXT4_MAX_WRITEBACK_PAGES. - * todo: need to calculate the max credits need for - * extent based files, currently the DIO credits is based on - * indirect-blocks mapping way. - * - * Probably should have a generic way to calculate credits - * for DIO, writepages, and truncate - */ -#define EXT4_MAX_WRITEBACK_PAGES DIO_MAX_BLOCKS -#define EXT4_MAX_WRITEBACK_CREDITS DIO_CREDITS - static int ext4_da_writepages(struct address_space *mapping, struct writeback_control *wbc) { @@ -2176,22 +2153,24 @@ static int ext4_da_writepages(struct add int ret = 0; long to_write; loff_t range_start = 0; + int blocks_per_page = PAGE_CACHE_SIZE >> inode->i_blkbits; + int max_credit_blocks = ext4_journal_max_transaction_buffers(inode); + int need_credits_per_page = ext4_writepages_trans_blocks(inode, 1); + int max_writeback_pages = (max_credit_blocks / blocks_per_page) / need_credits_per_page; /* * No pages to write? This is mainly a kludge to avoid starting * a transaction for special inodes like journal inode on last iput() * because that could violate lock ordering on umount */ - if (!mapping->nrpages) + if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) return 0; - /* - * Estimate the worse case needed credits to write out - * EXT4_MAX_BUF_BLOCKS pages - */ - needed_blocks = EXT4_MAX_WRITEBACK_CREDITS; + if (wbc->nr_to_write > mapping->nrpages) + wbc->nr_to_write = mapping->nrpages; to_write = wbc->nr_to_write; + if (!wbc->range_cyclic) { /* * If range_cyclic is not set force range_cont @@ -2202,10 +2181,31 @@ static int ext4_da_writepages(struct add } while (!ret && to_write) { + /* + * set the max dirty pages could be write at a time + * to fit into the reserved transaction credits + */ + if (wbc->nr_to_write > max_writeback_pages) + wbc->nr_to_write = max_writeback_pages; + + /* + * Estimate the worse case needed credits to write out + * to_write pages + */ + needed_blocks = ext4_writepages_trans_blocks(inode, + wbc->nr_to_write); + while (needed_blocks > max_credit_blocks) { + wbc->nr_to_write --; + needed_blocks = ext4_writepages_trans_blocks(inode, + wbc->nr_to_write); + } /* start a new transaction*/ handle = ext4_journal_start(inode, needed_blocks); if (IS_ERR(handle)) { ret = PTR_ERR(handle); + printk(KERN_EMERG "%s: Not enough credits to flush %ld pages\n", __func__, + wbc->nr_to_write); + dump_stack(); goto out_writepages; } if (ext4_should_order_data(inode)) { @@ -2221,12 +2221,6 @@ static int ext4_da_writepages(struct add } } - /* - * set the max dirty pages could be write at a time - * to fit into the reserved transaction credits - */ - if (wbc->nr_to_write > EXT4_MAX_WRITEBACK_PAGES) - wbc->nr_to_write = EXT4_MAX_WRITEBACK_PAGES; to_write -= wbc->nr_to_write; ret = mpage_da_writepages(mapping, wbc, @@ -2587,7 +2581,8 @@ static int __ext4_journalled_writepage(s * references to buffers so we are safe */ unlock_page(page); - handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode)); + handle = ext4_journal_start(inode, + ext4_writepages_trans_blocks(inode, 1)); if (IS_ERR(handle)) { ret = PTR_ERR(handle); goto out; @@ -4271,20 +4266,20 @@ int ext4_getattr(struct vfsmount *mnt, s /* * How many blocks doth make a writepage()? * - * With N blocks per page, it may be: - * N data blocks + * With N blocks per page, and P pages, it may be: + * N*P data blocks * 2 indirect block * 2 dindirect * 1 tindirect - * N+5 bitmap blocks (from the above) - * N+5 group descriptor summary blocks + * N*P+5 bitmap blocks (from the above) + * N*P+5 group descriptor summary blocks * 1 inode block * 1 superblock. * 2 * EXT4_SINGLEDATA_TRANS_BLOCKS for the quote files * - * 3 * (N + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS + * 3 * (N*P + 5) + 2 + 2 * EXT4_SINGLEDATA_TRANS_BLOCKS * - * With ordered or writeback data it's the same, less the N data blocks. + * With ordered or writeback data it's the same, less the N*P data blocks. * * If the inode's direct blocks can hold an integral number of pages then a * page cannot straddle two indirect blocks, and we can only touch one indirect @@ -4295,30 +4290,49 @@ int ext4_getattr(struct vfsmount *mnt, s * block and work out the exact number of indirects which are touched. Pah. */ -int ext4_writepage_trans_blocks(struct inode *inode) +static int ext4_writeblocks_trans_credits_old(struct inode *inode, int nrblocks) { - int bpp = ext4_journal_blocks_per_page(inode); - int indirects = (EXT4_NDIR_BLOCKS % bpp) ? 5 : 3; + int indirects = (EXT4_NDIR_BLOCKS % nrblocks) ? 5 : 3; int ret; - if (EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) - return ext4_ext_writepage_trans_blocks(inode, bpp); - if (ext4_should_journal_data(inode)) - ret = 3 * (bpp + indirects) + 2; + ret = 3 * (nrblocks + indirects) + 2; else - ret = 2 * (bpp + indirects) + 2; + ret = 2 * nrblocks + 3* indirects + 2; -#ifdef CONFIG_QUOTA /* We know that structure was already allocated during DQUOT_INIT so * we will be updating only the data blocks + inodes */ ret += 2*EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb); -#endif return ret; } /* + * Calulate the total number of credits to reserve to fit + * the modification of @num pages into a single transaction + * + * This could be called via ext4_write_begin() or later + * ext4_da_writepages() in delalyed allocation case. + * + * In both case it's possible that we could allocating multiple + * chunks of blocks. We need to consider the worse case, when + * one new block per extent. + * + * For Direct IO and fallocate, the journal credits reservation + * is based on one single extent allocation, so they could use + * EXT4_DATA_TRANS_BLOCKS to get the needed credit to log a single + * chunk of allocation needs. + */ +int ext4_writepages_trans_blocks(struct inode *inode, int nrpages) +{ + int bpp = ext4_journal_blocks_per_page(inode); + int nrblocks = nrpages * bpp; + + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + return ext4_writeblocks_trans_credits_old(inode, nrblocks); + return ext4_ext_writeblocks_trans_credits(inode, nrblocks); +} +/* * The caller must have previously called ext4_reserve_inode_write(). * Give this, we know that the caller already has write access to iloc->bh. */ Index: linux-2.6.26git6/fs/ext4/migrate.c =================================================================== --- linux-2.6.26git6.orig/fs/ext4/migrate.c 2008-07-13 14:51:29.000000000 -0700 +++ linux-2.6.26git6/fs/ext4/migrate.c 2008-07-28 22:53:21.000000000 -0700 @@ -52,9 +52,11 @@ static int finish_range(handle_t *handle * Since we are doing this in loop we may accumalate extra * credit. But below we try to not accumalate too much * of them by restarting the journal. + * + * extra 4 credits for: 1 superblock, 1 inode block, 2 quotas */ - needed = ext4_ext_calc_credits_for_insert(inode, path); - + needed = ext4_ext_calc_credits_for_single_extent(inode, path) + 2 + + 2 * EXT4_QUOTA_TRANS_BLOCKS(inode->i_sb); /* * Make sure the credit we accumalated is not really high */ Index: linux-2.6.26git6/fs/ext4/ext4_extents.h =================================================================== --- linux-2.6.26git6.orig/fs/ext4/ext4_extents.h 2008-07-28 22:47:22.000000000 -0700 +++ linux-2.6.26git6/fs/ext4/ext4_extents.h 2008-07-28 22:55:40.000000000 -0700 @@ -216,7 +216,8 @@ extern int ext4_ext_calc_metadata_amount extern ext4_fsblk_t idx_pblock(struct ext4_extent_idx *); extern void ext4_ext_store_pblock(struct ext4_extent *, ext4_fsblk_t); extern int ext4_extent_tree_init(handle_t *, struct inode *); -extern int ext4_ext_calc_credits_for_insert(struct inode *, struct ext4_ext_path *); +extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode, + struct ext4_ext_path *path); extern int ext4_ext_try_to_merge(struct inode *inode, struct ext4_ext_path *path, struct ext4_extent *); Index: linux-2.6.26git6/fs/ext4/ext4_jbd2.h =================================================================== --- linux-2.6.26git6.orig/fs/ext4/ext4_jbd2.h 2008-07-28 22:47:22.000000000 -0700 +++ linux-2.6.26git6/fs/ext4/ext4_jbd2.h 2008-07-28 22:53:21.000000000 -0700 @@ -231,4 +231,14 @@ static inline int ext4_should_writeback_ return 0; } +static inline int ext4_journal_max_transaction_buffers(struct inode *inode) +{ + /* + * max transaction buffers + * calculation based on + * journal->j_max_transaction_buffers = journal->j_maxlen / 4; + */ + return (EXT4_JOURNAL(inode))->j_maxlen / 4; +} + #endif /* _EXT4_JBD2_H */ -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html