Re: Recent kernel "mount" slow

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Tue, 27 Nov 2012 22:57:07 -0500 (EST)

On Tue, 27 Nov 2012, Jens Axboe wrote:

> On 2012-11-27 11:06, Jeff Chua wrote:
> > On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe <axboe@xxxxxxxxx> wrote:
> >> On 2012-11-27 06:57, Jeff Chua wrote:
> >>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua <jeff.chua.linux@xxxxxxxxx> wrote:
> >>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> >>>>> So it's better to slow down mount.
> >>>>
> >>>> I am quite proud of the linux boot time pitting against other OS. Even
> >>>> with 10 partitions. Linux can boot up in just a few seconds, but now
> >>>> you're saying that we need to do this semaphore check at boot up. By
> >>>> doing so, it's inducing additional 4 seconds during boot up.
> >>>
> >>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
> >>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
> >>> kind of degradation would this cause or just the same?
> >>
> >> It'd likely be the same slow down time wise, but as a percentage it
> >> would appear smaller on a slower disk.
> >>
> >> Could you please test Mikulas' suggestion of changing
> >> synchronize_sched() in include/linux/percpu-rwsem.h to
> >> synchronize_sched_expedited()?
> > 
> > Tested. It seems as fast as before, but may be a "tick" slower. Just
> > perception. I was getting pretty much 0.012s with everything reverted.
> > With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
> > So, it's good.
> 
> Excellent
> 
> >> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> >> tree. It would be a good data point it you could test that, too.
> > 
> > Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
> 
> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
> the real issue, which is having to do synchronize_sched() in the first
> place.
> 
> > # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
> > 
> > 
> > So, here's the comparison ...
> > 
> > 0.500s     3.7.0-rc7
> > 0.168s     3.7.0-rc2
> > 0.012s     3.6.0
> > 0.013s     3.7.0-rc7 + synchronize_sched_expedited()
> > 0.350s     3.7.0-rc7 + Oleg's patch.
> 
> I wonder how many of them are due to changing to the same block size.
> Does the below patch make a difference?

This patch is wrong because you must check if the device is mapped while 
holding bdev->bd_block_size_semaphore (because 
bdev->bd_block_size_semaphore prevents new mappings from being created)

I'm sending another patch that has the same effect.

Note that ext[234] filesystems set blocksize to 1024 temporarily during 
mount, so it doesn't help much (it only helps for other filesystems, such 
as jfs). For ext[234], you have a device with default block size 4096, the 
filesystem sets block size to 1024 during mount, reads the super block and 
sets it back to 4096.

If you want, you can fix ext[234] to avoid this behavior.

Mikulas

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 1a1e5e3..f041c56 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size)
>  	if (size < bdev_logical_block_size(bdev))
>  		return -EINVAL;
>  
> -	/* Prevent starting I/O or mapping the device */
> -	percpu_down_write(&bdev->bd_block_size_semaphore);
> -
>  	/* Check that the block device is not memory mapped */
>  	mapping = bdev->bd_inode->i_mapping;
>  	mutex_lock(&mapping->i_mmap_mutex);
>  	if (mapping_mapped(mapping)) {
>  		mutex_unlock(&mapping->i_mmap_mutex);
> -		percpu_up_write(&bdev->bd_block_size_semaphore);
>  		return -EBUSY;
>  	}
>  	mutex_unlock(&mapping->i_mmap_mutex);
>  
>  	/* Don't change the size if it is same as current */
>  	if (bdev->bd_block_size != size) {
> -		sync_blockdev(bdev);
> -		bdev->bd_block_size = size;
> -		bdev->bd_inode->i_blkbits = blksize_bits(size);
> -		kill_bdev(bdev);
> +		/* Prevent starting I/O */
> +		percpu_down_write(&bdev->bd_block_size_semaphore);
> +		if (bdev->bd_block_size != size) {
> +			sync_blockdev(bdev);
> +			bdev->bd_block_size = size;
> +			bdev->bd_inode->i_blkbits = blksize_bits(size);
> +			kill_bdev(bdev);
> +		}
> +		percpu_up_write(&bdev->bd_block_size_semaphore);
>  	}
>  
> -	percpu_up_write(&bdev->bd_block_size_semaphore);
> -
>  	return 0;
>  }
>  
> @@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write);
>  
>  static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> +	struct address_space *mapping = file->f_mapping;
>  	int ret;
> -	struct block_device *bdev = I_BDEV(file->f_mapping->host);
> -
> -	percpu_down_read(&bdev->bd_block_size_semaphore);
>  
> +	mutex_lock(&mapping->i_mmap_mutex);
>  	ret = generic_file_mmap(file, vma);
> -
> -	percpu_up_read(&bdev->bd_block_size_semaphore);
> +	mutex_unlock(&mapping->i_mmap_mutex);
>  
>  	return ret;
>  }
> 
> -- 
> Jens Axboe
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html