Re: [PATCH] fix ext4_free_inode vs. ext4_claim_inode race

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Eric Sandeen wrote:

> I was seeing fsck errors on inode bitmaps after a 4 thread
> dbench run on a 4 cpu machine:
>
> Inode bitmap differences: -50736 -(50752--50753) etc...
>
> I believe that this is because ext4_free_inode() uses atomic
> bitops, and although ext4_new_inode() *used* to also use atomic 
> bitops for synchronization, commit 
> 393418676a7602e1d7d3f6e560159c65c8cbd50e changed this to use
> the sb_bgl_lock, so that we could also synchronize against
> read_inode_bitmap and initialization of uninit inode tables.
>
> However, that change left ext4_free_inode using atomic bitops,
> which I think leaves no synchronization between setting & 
> unsetting bits in the inode table.
>
> The below patch fixes it for me, although I wonder if we're 
> getting at all heavy-handed with this spinlock...
>
> Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx>
> ---
>   
This patch may be a bit faster, though a little trickier.

Basically, we have 2 races to worry about, I think:
* set_bit vs. clear_bit
* set_bit vs. itable init

We don't have to worry about clear_bit vs. itable_init because if
we are clearing an inode, by definition the itable can't be uninit.

So I think we can leave ext4_free_inode() as-is with the atomic
bitops clearing the bitmap, and instead just modify ext4_claim_inode():

The idea is to only take the spinlock if the bg is uninit (using
a test/lock/retest scheme) and if it's not really uninit (meaning
there is no race with the uninit->initialization thread) then
just use the atomic bitops at that point.

This did seem to speed up my dbench testing by a percent or two,
though I should probably do an aggregate of a few more runs to be
sure.

If this is too tricky and could use more soak time, we could live
with the original patch for 2.6.29, I think, while we ponder & test
a better solution to this.

(I need to convince myself that there is no window here yet
between the potentially nonatomic set_bit and the atomic
clear_bit in free_inode... but I think it's ok, as the inode
should not be findable to be freed until new_inode is quite
finished with it.)

Thanks,
-Eric

Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx>
---

Index: linux-2.6/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/ialloc.c
+++ linux-2.6/fs/ext4/ialloc.c
@@ -609,26 +609,33 @@ static int ext4_claim_inode(struct super
 			struct buffer_head *inode_bitmap_bh,
 			unsigned long ino, ext4_group_t group, int mode)
 {
-	int free = 0, retval = 0, count;
+	int free = 0, bitset, count;
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	struct ext4_group_desc *gdp = ext4_get_group_desc(sb, group, NULL);
 
-	spin_lock(sb_bgl_lock(sbi, group));
-	if (ext4_set_bit(ino, inode_bitmap_bh->b_data)) {
-		/* not a free inode */
-		retval = 1;
-		goto err_ret;
+	/* if uninit, protect against ext4_read_inode_bitmap initialization */
+	bitset = -1;
+	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+		spin_lock(sb_bgl_lock(sbi, group));
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))
+			bitset = ext4_set_bit(ino, inode_bitmap_bh->b_data);
+		spin_unlock(sb_bgl_lock(sbi, group));
 	}
+	if (bitset < 0) /* we didn't set it above, so not uninit */
+		bitset = ext4_set_bit_atomic(sb_bgl_lock(sbi, group),
+					ino, inode_bitmap_bh->b_data);
+	if (bitset)	/* this is not a free inode */
+		return 1;
 	ino++;
 	if ((group == 0 && ino < EXT4_FIRST_INO(sb)) ||
 			ino > EXT4_INODES_PER_GROUP(sb)) {
-		spin_unlock(sb_bgl_lock(sbi, group));
 		ext4_error(sb, __func__,
 			   "reserved inode or inode > inodes count - "
 			   "block_group = %u, inode=%lu", group,
 			   ino + group * EXT4_INODES_PER_GROUP(sb));
 		return 1;
 	}
+	spin_lock(sb_bgl_lock(sbi, group));
 	/* If we didn't allocate from within the initialized part of the inode
 	 * table then we need to initialize up to this inode. */
 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
@@ -665,9 +672,8 @@ static int ext4_claim_inode(struct super
 		ext4_used_dirs_set(sb, gdp, count);
 	}
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
-err_ret:
 	spin_unlock(sb_bgl_lock(sbi, group));
-	return retval;
+	return 0;
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux