Re: Reducing the impact of OSD restarts (noout ain't uptosnuff)

Christian Balzer <chibi@xxxxxxx> · Sat, 13 Feb 2016 23:52:29 +0900

Hello,

On Sat, 13 Feb 2016 11:14:23 +0100 Lionel Bouton wrote:

> Le 13/02/2016 06:31, Christian Balzer a écrit :
> > [...] > --- > So from shutdown to startup about 2 seconds, not that
> > bad. >
> However here is where the cookie crumbles massively: > --- > 2016-02-12
> 01:33:50.263152 7f75be4d57c0  0 filestore(/var/lib/ceph/osd/ceph-2)
> limited size xattrs > 2016-02-12 01:35:31.809897 7f75be4d57c0  0
> filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal
> mode > : checkpoint is not enabled > --- > Nearly 2 minutes to mount
> things, it probably had to go to disk quite a > bit, as not everything
> was in the various slab caches. And yes, there is > 32GB of RAM, most of
> it pagecache and vfs_cache_pressure is set to 1. > During that time,
> silence of the lambs when it came to ops.
>
>
> Hum that's surprisingly long. How much data (size and nb of files) do
> you have on this OSD, which FS do you use, what are the mount options,
> what is the hardware and the kind of access ?
> 
I already mentioned the HW, Areca RAID controller with 2GB HW cache and a
7 disk RAID6 per OSD. 
Nothing aside from noatime for mount options and EXT4.

2.6TB per OSD and with 1.4 million objects in the cluster a little more
than 700k files per OSD.

And kindly take note that my test cluster has less than 120k objects and
thus 15k files per OSD and I still was able to reproduce this behaviour (in
spirit at least).

> The only time I saw OSDs take several minutes to reach the point where
> they fully rejoin is with BTRFS with default options/config.
>
There isn't a pole long enough I would touch BTRFS with for production,
especially in conjunction with Ceph.

> For reference our last OSD restart only took 6 seconds to complete this
> step. We only have RBD storage, so this OSD with 1TB of data has ~250000
> 4M files. It was created ~ 1 year ago and this is after a complete OS
> umount/mount cycle which drops the cache (from experience Ceph mount
> messages doesn't actually imply that the FS was not mounted).
>
The "mount" in the ceph logs clearly is not a FS/OS level mount.
This OSD was up for about 2 years.

My other, more "conventional" production cluster has 400GB and 100k files
per OSD and is very fast to restart as well. 
Alas it is also nowhere near as busy as this cluster, by order of 2
magnitudes roughly.

> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
> > load_pgs opened
> 564 pgs > --- > Another minute to load the PGs.
> Same OSD reboot as above : 8 seconds for this.
> 
> This would be way faster if we didn't start with an umounted OSD.
> 
Again, it was never unmounted from a FS/OS perspective.

Regards,

Christian

> This OSD is still BTRFS but we don't use autodefrag anymore (we replaced
> it with our own defragmentation scheduler) and disabled BTRFS snapshots
> in Ceph to reach this point. Last time I checked an OSD startup was
> still faster with XFS.
> 
> So do you use BTRFS in the default configuration or have a very high
> number of files on this OSD ?
> 
> Lionel

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com