Re: [PATCH] ext4: improve smp scalability for inode generation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Nov 10, 2017, at 3:39 PM, Andreas Dilger <adilger@xxxxxxxxx> wrote:
> 
> On Nov 8, 2017, at 8:23 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
>> commit f0e922e7235e1b5ba6fd964e2cf8dafed3248a15
>> Author: Theodore Ts'o <tytso@xxxxxxx>
>> Date:   Wed Nov 8 22:21:58 2017 -0500
>> 
>>   ext4: improve smp scalability for inode generation
>> 
>>   ->s_next_generation is protected by s_next_gen_lock but its usage
>>   pattern is very primitive.  We don't actually need sequentailly
>>   increasing new generation numbers, so let's use prandom_u32() instead.
> 
> We discussed this on the ext4 concall this week, but came to the conclusion
> that a 32-bit random number has an unacceptably high chance of collision
> (about 1/2^16 allocations) due to the birthday paradox.
> 
> Instead, an approach like the following should be OK:
> 1. num_gen_per_cpu = 2^32/num_online_cpus()
> 2. initialize CPU0 generation = prandom_u32()
> 3. initialize CPUn generation = CPU0 + n * num_gen_per_cpu
> 4. use sequential generation numbers for each local CPU
> 5. after num_gen_per_cpu generations per CPU hit on any CPU, goto 2.

Actually, never mind.  I wrote a test program to verify this, and this
pointed out the flaw in my argument, namely that the birthday paradox
only matters if each random number is compared against all other random
numbers in the pool.

In this case, we only care whether i_generation.old == i_generation.new
for each individual inode, and not whether i_generation.new is the same
as _any_ old generation.

That means there is a 1/2^32 chance of collisions for each allocated
inode, rather than 1/2^16.  With only prandom_u32() there is always
a 1/2^32 chance that creating and deleting the same inode in a loop
will result in two identical sequential generation numbers.

There was also a chance of collisions with the old code, though it
wouldn't happen for a long time when the workload was reusing the
same set of inodes repeatedly (e.g. create and delete in a single
directory).  Instead, it would be more likely to hit for old inodes
that were recently deleted and re-allocated, or after a reboot.

This (mostly|only) affects NFS exports, so I'm not sure how likely
it is to happen in real life.

Cheers, Andreas

>> 
>>   Reported-by: Dmitry Monakhov <dmonakhov@xxxxxxxxxx>
>>   Signed-off-by: Theodore Ts'o <tytso@xxxxxxx>
>> 
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 53ce95b52fd8..5e6d7b6f50c7 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1355,8 +1355,6 @@ struct ext4_sb_info {
>> 	int s_first_ino;
>> 	unsigned int s_inode_readahead_blks;
>> 	unsigned int s_inode_goal;
>> -	spinlock_t s_next_gen_lock;
>> -	u32 s_next_generation;
>> 	u32 s_hash_seed[4];
>> 	int s_def_hash_version;
>> 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
>> index ee823022aa34..da79eb5dba40 100644
>> --- a/fs/ext4/ialloc.c
>> +++ b/fs/ext4/ialloc.c
>> @@ -1138,9 +1138,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>> 			   inode->i_ino);
>> 		goto out;
>> 	}
>> -	spin_lock(&sbi->s_next_gen_lock);
>> -	inode->i_generation = sbi->s_next_generation++;
>> -	spin_unlock(&sbi->s_next_gen_lock);
>> +	inode->i_generation = prandom_u32();
>> 
>> 	/* Precompute checksum seed for inode metadata */
>> 	if (ext4_has_metadata_csum(sb)) {
>> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
>> index 144bbda2b808..98ad8172dfd3 100644
>> --- a/fs/ext4/ioctl.c
>> +++ b/fs/ext4/ioctl.c
>> @@ -17,6 +17,7 @@
>> #include <linux/uuid.h>
>> #include <linux/uaccess.h>
>> #include <linux/delay.h>
>> +#include <linux/random.h>
>> #include "ext4_jbd2.h"
>> #include "ext4.h"
>> #include <linux/fsmap.h>
>> @@ -157,10 +158,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
>> 
>> 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
>> 
>> -	spin_lock(&sbi->s_next_gen_lock);
>> -	inode->i_generation = sbi->s_next_generation++;
>> -	inode_bl->i_generation = sbi->s_next_generation++;
>> -	spin_unlock(&sbi->s_next_gen_lock);
>> +	inode->i_generation = prandom_u32();
>> +	inode_bl->i_generation = prandom_u32();
>> 
>> 	ext4_discard_preallocations(inode);
>> 
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 3a278faf5868..9f2e3eb5131f 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -3982,8 +3982,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>> 	}
>> 
>> 	sbi->s_gdb_count = db_count;
>> -	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
>> -	spin_lock_init(&sbi->s_next_gen_lock);
>> 
>> 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>> 
> 
> 
> Cheers, Andreas


Cheers, Andreas





Attachment: signature.asc
Description: Message signed with OpenPGP


[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux