Re: Ceph OSD very slow startup

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 14 Oct 2014 09:17:53 -0700

On Monday, October 13, 2014, Lionel Bouton <lionel+ceph@xxxxxxxxxxx> wrote:
Hi,

# First a short description of our Ceph setup

You can skip to the next section ("Main questions") to save time and

come back to this one if you need more context.

We are currently moving away from DRBD-based storage backed by RAID

arrays to Ceph for some of our VMs. Our focus is on resiliency and

capacity (one VM was outgrowing the largest RAID10 we had) and not

maximum performance (at least not yet). Our Ceph OSDs are fairly

unbalanced because 2 are on 2 historic hosts each with 4 disks in a

hardware RAID10 configuration and no place available for new disks in

the chassis. 12 additional OSD are on 2 new systems with 6 disk drives

dedicated to one OSD each (CPU and RAM configurations are nearly

identical on the 4 hosts). All hosts are used for running VMs too, we

took some precautions to avoid too much interference: each host has CPU

and RAM to spare for the OSD. CPU usage exhibits some bursts on

occasions but as we only have one or two VM on each host, they can't

starve the OSD which have between 2 and 8 full fledge cores (4 to 16

hardware threads) for them depending on the current load. We have at

least 4GB of free RAM per OSD on each host at all times (including room

for at least a 4GB OS cache).

To sum up we have a total of 14 OSDs, the 2 largest ones on RAID10 are

clearly our current bottleneck. That said until we have additional

hardware they allow us to maintain availability even if 2 servers are

down (default crushmap with pool configured with 3 replicas on 3

different hosts) and performance is acceptable (backfilling/scrubing/...

pgs required some tuning though and I'm eagerly waiting for 0.80.7 to

begin tests of the new io priority tunables).

Everything is based on SATA/SAS 7200t/min disk drives behind P410 Raid

controllers (HP Proliant systems) with battery backed memory to help

with write bursts.

The OSDs are a mix of:

- Btrfs on 3.17.0 kernels on individual disks, 450GB use on 2TB (3.17.0

fixes a filesystem lockup we had with earlier kernels manifesting itself

with concurrent accesses to several Btrfs filesystems according to

recent lkml posts),

- Btrfs on 3.12.21 kernels on the 2 systems with RAID10, 1.5TB used on

3TB (no lockup on these yet but they will migrate to 3.17.0 when we'll

have enough experience with it).

- XFS for a minority of individual disks (with a dedicated partition for

the journal).

Most of them have the same history (all being created at the same time),

only two of them have been created later (following Btrfs corruption

and/or conversion to XFS) and are avoided when comparing behaviours.

All Btrfs volumes use these mount options:

rw,noatime,nodiratime,compress=lzo,space_cache,autodefrag,recovery

All OSDs use a 5GB journal.

We slowly add monitoring to the setup to see what are the benefits of

Btrfs in our case (ceph osd perf, kernel io wait per devices, osd CPU

usage, ...). One long term objective is to slowly raise the performance

both by migrating to/adding more suitable hardware and tuning the

software side. Detailed monitoring is supposed to help us study the

behaviour of isolated OSDs with different settings and being warned

early if they generate performance problems to take them out with next

to no impact on the whole storage network (we are strong believers in

slow, incremental and continuous change and distributed storage with

redundancy makes it easy to implement).

# Main questions

The system works well but I just realised when restarting one of the 2

large Btrfs OSD that it was very slow to rejoin the network ("ceph osd

set noout" was used for the restart). I stopped the OSD init after 5

minutes to investigate what was going on and didn't find any obvious

problem (filesystem sane, no swapping, CPU hogs, concurrent IO not able

to starve the system by itself, ...). Next restarts took between 43s

(nearly no concurrent disk access and warm caches after an earlier

restart without umounting the filesystem) and 3mn57s (one VM still on

DRBD doing ~30 IO/s on the same volume and cold caches after a

filesystem mount).

It seems that the startup time is getting longer on the 2 large Btrfs

filesystems (the other one gives similar results: 3mn48s on the first

try for example). I noticed that it was a bit slow a week ago but not as

much (there was ~half as much data on them at the time). OSDs on

individual disks don't exhibit this problem (with warm caches init

finishes in ~4s on the small Btrfs volumes, ~3s on the XFS volumes) but

they are on dedicated disks with less data.

With warm caches most of the time is spent between:

"osd.<n> <osdmap> load_pgs"

"osd.<n> <osdmap> load_pgs opened <m> pgs"

log lines in /var/log/ceph/ceph-osd.<n>.log (m is ~650 for both OSD). So

it seems most of the time is spent opening pgs.

What could explain such long startup times? Is the OSD init doing a lot

of random disk accesses? Is it dependant on the volume of data or the

history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known

performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the

data and a similar history of disk accesses we have 1/10 the init time

when the disks are in both cases idle)?

Something like this is my guess; we've historically seen btrfs performance rapidly degrade under our workloads. And I imagine that your single-disk OSDs are only seeing 100 or so PGs each?
You could perhaps turn up OSD and FileStore debugging on one of your big nodes and one of the little ones and do a restart and compare the syscall wait times between them to check.
-Greg

# Init snap destroy errors on Btrfs question

During each init on Btrfs backed OSD we get this kind of errors in

ceph-osd logs (always come in pairs like that at the very beginning of

the phase where the OSD opens the pgs):

2014-10-13 23:54:44.143039 7fd4267fc700  0

btrfsfilestorebackend(/var/lib/ceph/osd/ceph-1) destroy_checkpoint:

ioctl SNAP_DESTROY got (2) No such file or directory

2014-10-13 23:54:44.143087 7fd4267fc700 -1

filestore(/var/lib/ceph/osd/ceph-1) unable to destroy snap

'snap_21161231' got (2) No such file or directory

2014-10-13 23:54:44.266149 7fd4267fc700  0

btrfsfilestorebackend(/var/lib/ceph/osd/ceph-1) destroy_checkpoint:

ioctl SNAP_DESTROY got (2) No such file or directory

2014-10-13 23:54:44.266189 7fd4267fc700 -1

filestore(/var/lib/ceph/osd/ceph-1) unable to destroy snap

'snap_21161268' got (2) No such file or directory

I suppose it is harmless (at least these OSD don't show any other

error/warning and have been restarted and their filesystem remounted on

numerous occasions), but I'd like to be sure: is it?

Best regards,

Lionel Bouton

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com