On Thu, 14 Aug 2008 14:13:40 -0400 Chris Mason <chris.mason@xxxxxxxxxx> wrote: > write_cache_pages uses i_mapping->writeback_index to pick up where it > left off the last time a given inode was found by pdflush or > balance_dirty_pages (or anyone else who sets wbc->range_cyclic) > > alloc_inode should set it to a sane value so that writeback doesn't start > in the middle of a file. It is somewhat difficult to notice the bug since > write_cache_pages will loop around to the start of the file and the > elevator helps hide the resulting seeks. > > For whatever reason, Btrfs hits this often. Unpatched, untarring 30 copies of > the linux kernel in series runs at 47MB/s on a single sata drive. With > this fix, it jumps to 62MB/s. > > Signed-off-by: Chris Mason <chris.mason@xxxxxxxxxx> > > diff -r 67160ae21a58 fs/inode.c > --- a/fs/inode.c Wed Aug 06 19:26:20 2008 -0700 > +++ b/fs/inode.c Thu Aug 14 10:15:49 2008 -0400 > @@ -166,6 +166,7 @@ static struct inode *alloc_inode(struct > mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE); > mapping->assoc_mapping = NULL; > mapping->backing_dev_info = &default_backing_dev_info; > + mapping->writeback_index = 0; > > /* > * If the block_device provides a backing_dev_info for client > inode_cachep's ctor memsets the whole inode* to zero. Does btrfs have an ->alloc_inode handler? If so, perhaps it is btrfs which needs to zap that field. Formally, the way slab works (which is arguably silly from a CPU cache efficiency POV) is that the object should be returned to slab (kmem_cache_free()) in a "constructed" manner. Which means that this zeroing should happen in destroy_inode(). otoh, the fs/inode.c inode creation/destruction code isn't designed that way - it chose to initialise everything synchronously in alloc_inode(), which is arguably wrong. One wonders if that big memset in inode_init_once() actually does anything useful. Ho hum, so what to do? I'm suspecting that we should just zap the kmem_cachep constructor altogether, use a synchronous call to inode_init_once() to initialise all inode fields and then remove all those open-coded zeroings. Something like this? From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- fs/inode.c | 40 +++++++++------------------------------- 1 file changed, 9 insertions(+), 31 deletions(-) diff -puN fs/inode.c~a fs/inode.c --- a/fs/inode.c~a +++ a/fs/inode.c @@ -115,34 +115,24 @@ static struct inode *alloc_inode(struct static const struct file_operations empty_fops; struct inode *inode; - if (sb->s_op->alloc_inode) + if (sb->s_op->alloc_inode) { inode = sb->s_op->alloc_inode(sb); - else - inode = (struct inode *) kmem_cache_alloc(inode_cachep, GFP_KERNEL); + } else { + inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL); + if (inode) + inode_init_once(inode); + } if (inode) { struct address_space * const mapping = &inode->i_data; inode->i_sb = sb; inode->i_blkbits = sb->s_blocksize_bits; - inode->i_flags = 0; atomic_set(&inode->i_count, 1); inode->i_op = &empty_iops; inode->i_fop = &empty_fops; inode->i_nlink = 1; atomic_set(&inode->i_writecount, 0); - inode->i_size = 0; - inode->i_blocks = 0; - inode->i_bytes = 0; - inode->i_generation = 0; -#ifdef CONFIG_QUOTA - memset(&inode->i_dquot, 0, sizeof(inode->i_dquot)); -#endif - inode->i_pipe = NULL; - inode->i_bdev = NULL; - inode->i_cdev = NULL; - inode->i_rdev = 0; - inode->dirtied_when = 0; if (security_inode_alloc(inode)) { if (inode->i_sb->s_op->destroy_inode) inode->i_sb->s_op->destroy_inode(inode); @@ -158,15 +148,13 @@ static struct inode *alloc_inode(struct lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key); init_rwsem(&inode->i_alloc_sem); - lockdep_set_class(&inode->i_alloc_sem, &sb->s_type->i_alloc_sem_key); + lockdep_set_class(&inode->i_alloc_sem, + &sb->s_type->i_alloc_sem_key); mapping->a_ops = &empty_aops; mapping->host = inode; - mapping->flags = 0; mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE); - mapping->assoc_mapping = NULL; mapping->backing_dev_info = &default_backing_dev_info; - mapping->writeback_index = 0; /* * If the block_device provides a backing_dev_info for client @@ -181,7 +169,6 @@ static struct inode *alloc_inode(struct bdi = sb->s_bdev->bd_inode->i_mapping->backing_dev_info; mapping->backing_dev_info = bdi; } - inode->i_private = NULL; inode->i_mapping = mapping; } return inode; @@ -197,7 +184,6 @@ void destroy_inode(struct inode *inode) kmem_cache_free(inode_cachep, (inode)); } - /* * These are initializations that only need to be done * once, because the fields are idempotent across use @@ -222,16 +208,8 @@ void inode_init_once(struct inode *inode mutex_init(&inode->inotify_mutex); #endif } - EXPORT_SYMBOL(inode_init_once); -static void init_once(void *foo) -{ - struct inode * inode = (struct inode *) foo; - - inode_init_once(inode); -} - /* * inode_lock must be held */ @@ -1400,7 +1378,7 @@ void __init inode_init(void) 0, (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC| SLAB_MEM_SPREAD), - init_once); + NULL); register_shrinker(&icache_shrinker); /* Hash may have been set up in inode_init_early */ _ Note that this assumes that a filesystem which implements ->alloc_inode() will call inode_init_once() within its ->alloc_inode(). Which means, I think, that we can just remove inode_init_once() altogether and move its initialisations into alloc_inode() along with all the existing ones. What do you think? I suppose that's all a fun project but doesn't preclude merging your patchlet. I'll do that for now ;) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html