Re: Bug in delayed allocation: really bad block layouts!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 12, 2008 at 10:32:05PM -0400, Theodore Tso wrote:
> On Mon, Aug 11, 2008 at 11:45:24PM +0530, Aneesh Kumar K.V wrote:
> > On Mon, Aug 11, 2008 at 08:09:12PM +0530, Aneesh Kumar K.V wrote:
> > > Can you try this patch ? The patch make group preallocation use the goal
> > > block.
> > > 
> > 
> > Results with and without patch.
> > 
> > http://www.radian.org/~kvaneesh/ext4/lg-fragmentation/
> > 
> 
> My results match yours; seems to be a bit better, but it's not fixing
> the fundamental problem.  With the patch:
> 
>  26524: expecting 638190 actual extent phys 631960 log 1 len 1
>  26527: expecting 638191 actual extent phys 631963 log 1 len 1
>  26533: expecting 638192 actual extent phys 631976 log 1 len 5
>  26534: expecting 638193 actual extent phys 631981 log 1 len 2
>  26536: expecting 638194 actual extent phys 631984 log 1 len 6
>  26538: expecting 638195 actual extent phys 631991 log 1 len 5
>  26540: expecting 638196 actual extent phys 631997 log 1 len 2
>  26545: expecting 638197 actual extent phys 632009 log 1 len 1
>  26546: expecting 638198 actual extent phys 632010 log 1 len 6
>  26604: expecting 638199 actual extent phys 632156 log 1 len 1
> 
> Useing debugfs's stat command to look at the blocks:
> 
> 26524: (0):638189, (1):631960
> 26527: (0):638190, (1):631963
> 26533: (0):638191, (1-5):631976-631980
> 26534: (0):638192, (1-2):631981-631982
> 26536: (0):638193, (1-6):631984-631989
> 26538: (0):638194, (1-5):631991-631995
> 26540: (0):638195, (1-2):631997-631998
> 26545: (0):638196, (1):632009
> 26546: (0):638197, (1-6):632010-632015

I am not sure why we are getting single block request for inodes
26524 etc. With delayed alloc we should have got 2 block request.

> 
> Out of curiosity, I also probed the inode numbers that were out of
> sequence from above.  They seem to be mostly allocating out of the
> numbers used for the second extent, above.  
> 
> 26526: (0):631961
> 26526: (0):631962
> 26528: (0):631964
> 26529: (0):411742
> 26530: (0):631965
> 26531: (0-1):631966-631967
> 26532: (0-7):631968-631975
> 26535: (0):631983
> 26537: (0):631990
> 26541: (0-7):631999-632006
> 26542: (0):632007
> 26543: (0):632008
> 26544: (0):411743
> 26547: (0):632016
> 
> Inode  Pathname
> 26524  /lib/rhythmbox/plugins/lyrics/LyricsConfigureDialog.py
> 26525  /lib/rhythmbox/plugins/lyrics/LyrcParser.py
> 26526  /lib/rhythmbox/plugins/lyrics/LyricsParse.py
> 26527  /lib/rhythmbox/plugins/lyrics/LyricsConfigureDialog.pyc
> 26528  /lib/rhythmbox/plugins/lyrics/WinampcnParser.py
> 26529  /lib/rhythmbox/plugins/magnatune
> 26530  /lib/rhythmbox/plugins/magnatune/magnatune_logo_color_small.png
> 26531  /lib/rhythmbox/plugins/magnatune/magnatune.rb-plugin
> 26532  /lib/rhythmbox/plugins/magnatune/magnatune-prefs.glade
> 26533  /lib/rhythmbox/plugins/magnatune/MagnatuneSource.pyc
> 26534  /lib/rhythmbox/plugins/magnatune/__init__.py
> 26535  /lib/rhythmbox/plugins/magnatune/BuyAlbumHandler.py
> 26536  /lib/rhythmbox/plugins/magnatune/magnatune-purchase.glade
> 26537  /lib/rhythmbox/plugins/magnatune/TrackListHandler.py
> 26538  /lib/rhythmbox/plugins/magnatune/MagnatuneSource.py
> 26539  /lib/rhythmbox/plugins/magnatune/magnatune_logo_color_tiny.png
> 26540  /lib/rhythmbox/plugins/magnatune/__init__.pyc
> 26541  /lib/rhythmbox/plugins/magnatune/magnatune-loading.glade
> 26542  /lib/rhythmbox/plugins/magnatune/TrackListHandler.pyc
> 26543  /lib/rhythmbox/plugins/magnatune/BuyAlbumHandler.pyc
> 26544  /lib/rhythmbox/plugins/audioscrobbler
> 26546  /lib/rhythmbox/plugins/audioscrobbler/audioscrobbler-prefs.glade
> 26547  /lib/rhythmbox/plugins/audioscrobbler/audioscrobbler-ui.xml
> 
> Looks like we still have some problems with the block allocator...

The problem is with delalloc and mballoc locality group. With delalloc
we use pdflush to write the pages. Small file allocation use a per-cpu
prealloc space. In my understanding using Per-CPU prealloc space is
fine without delalloc. Because without delalloc get_block happens in the
process context at write_begin and OS scheduler will not schedule the
task to other CPU unless needed.

With delalloc we have pdflush doing block allocation and using per-cpu
may not really help here. So i tried a small patch as below. But that
didn't help much. Also the patch would increase contention on the
locality group mutex. So i guess the change is not worth. 

But with delalloc we should have got multiple block request together.
That implies we should get a single get_block request for the whole
file. I will have to instrument the kernel to understand why it is not
happening.

Even though the files are fragmented I guess we have those blocks closer
on disk right ?

diff --git a/fs/ext4/ext4_i.h b/fs/ext4/ext4_i.h
index ef7409f..734b6ef 100644
--- a/fs/ext4/ext4_i.h
+++ b/fs/ext4/ext4_i.h
@@ -163,6 +163,8 @@ struct ext4_inode_info {
 	/* mballoc */
 	struct list_head i_prealloc_list;
 	spinlock_t i_prealloc_lock;
+	/* locality group used for block allocation */
+	struct ext4_locality_group *lg;
 
 	/* allocation reservation info for delalloc */
 	unsigned long i_reserved_data_blocks;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 25fe375..293f048 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4061,9 +4061,10 @@ static inline void ext4_mb_show_ac(struct ext4_allocation_context *ac)
  */
 static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 {
-	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
-	int bsbits = ac->ac_sb->s_blocksize_bits;
 	loff_t size, isize;
+	int bsbits = ac->ac_sb->s_blocksize_bits;
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
 
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return;
@@ -4085,13 +4086,23 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	 * per cpu locality group is to reduce the contention between block
 	 * request from multiple CPUs.
 	 */
-	ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
-	put_cpu();
+	if (ei->lg)
+		ac->ac_lg = ei->lg;
+	else {
+		ac->ac_lg = &sbi->s_locality_groups[get_cpu()];
+		ei->lg = ac->ac_lg;
+		put_cpu();
+	}
 
 	/* we're going to use group allocation */
 	ac->ac_flags |= EXT4_MB_HINT_GROUP_ALLOC;
 
-	/* serialize all allocations in the group */
+	/*
+	 * serialize all allocations in the group
+	 * If we find lot of contention we may want
+	 * to add waiters count and use other lg if
+	 * we have large number of waiters
+	 */
 	mutex_lock(&ac->ac_lg->lg_mutex);
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 09e3c56..08bdbf9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -576,6 +576,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_reserved_meta_blocks = 0;
 	ei->i_allocated_meta_blocks = 0;
 	ei->i_delalloc_reserved_flag = 0;
+	ei->lg = NULL;
 	spin_lock_init(&(ei->i_block_reservation_lock));
 	return &ei->vfs_inode;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux