Ceph OSD very slow startup

Lionel Bouton <lionel+ceph@xxxxxxxxxxx> · Tue, 14 Oct 2014 01:28:13 +0200

Hi,

# First a short description of our Ceph setup

You can skip to the next section ("Main questions") to save time and
come back to this one if you need more context.

We are currently moving away from DRBD-based storage backed by RAID
arrays to Ceph for some of our VMs. Our focus is on resiliency and
capacity (one VM was outgrowing the largest RAID10 we had) and not
maximum performance (at least not yet). Our Ceph OSDs are fairly
unbalanced because 2 are on 2 historic hosts each with 4 disks in a
hardware RAID10 configuration and no place available for new disks in
the chassis. 12 additional OSD are on 2 new systems with 6 disk drives
dedicated to one OSD each (CPU and RAM configurations are nearly
identical on the 4 hosts). All hosts are used for running VMs too, we
took some precautions to avoid too much interference: each host has CPU
and RAM to spare for the OSD. CPU usage exhibits some bursts on
occasions but as we only have one or two VM on each host, they can't
starve the OSD which have between 2 and 8 full fledge cores (4 to 16
hardware threads) for them depending on the current load. We have at
least 4GB of free RAM per OSD on each host at all times (including room
for at least a 4GB OS cache).
To sum up we have a total of 14 OSDs, the 2 largest ones on RAID10 are
clearly our current bottleneck. That said until we have additional
hardware they allow us to maintain availability even if 2 servers are
down (default crushmap with pool configured with 3 replicas on 3
different hosts) and performance is acceptable (backfilling/scrubing/...
pgs required some tuning though and I'm eagerly waiting for 0.80.7 to
begin tests of the new io priority tunables).
Everything is based on SATA/SAS 7200t/min disk drives behind P410 Raid
controllers (HP Proliant systems) with battery backed memory to help
with write bursts.

The OSDs are a mix of:
- Btrfs on 3.17.0 kernels on individual disks, 450GB use on 2TB (3.17.0
fixes a filesystem lockup we had with earlier kernels manifesting itself
with concurrent accesses to several Btrfs filesystems according to
recent lkml posts),
- Btrfs on 3.12.21 kernels on the 2 systems with RAID10, 1.5TB used on
3TB (no lockup on these yet but they will migrate to 3.17.0 when we'll
have enough experience with it).
- XFS for a minority of individual disks (with a dedicated partition for
the journal).
Most of them have the same history (all being created at the same time),
only two of them have been created later (following Btrfs corruption
and/or conversion to XFS) and are avoided when comparing behaviours.

All Btrfs volumes use these mount options:
rw,noatime,nodiratime,compress=lzo,space_cache,autodefrag,recovery
All OSDs use a 5GB journal.

We slowly add monitoring to the setup to see what are the benefits of
Btrfs in our case (ceph osd perf, kernel io wait per devices, osd CPU
usage, ...). One long term objective is to slowly raise the performance
both by migrating to/adding more suitable hardware and tuning the
software side. Detailed monitoring is supposed to help us study the
behaviour of isolated OSDs with different settings and being warned
early if they generate performance problems to take them out with next
to no impact on the whole storage network (we are strong believers in
slow, incremental and continuous change and distributed storage with
redundancy makes it easy to implement).

# Main questions

The system works well but I just realised when restarting one of the 2
large Btrfs OSD that it was very slow to rejoin the network ("ceph osd
set noout" was used for the restart). I stopped the OSD init after 5
minutes to investigate what was going on and didn't find any obvious
problem (filesystem sane, no swapping, CPU hogs, concurrent IO not able
to starve the system by itself, ...). Next restarts took between 43s
(nearly no concurrent disk access and warm caches after an earlier
restart without umounting the filesystem) and 3mn57s (one VM still on
DRBD doing ~30 IO/s on the same volume and cold caches after a
filesystem mount).

It seems that the startup time is getting longer on the 2 large Btrfs
filesystems (the other one gives similar results: 3mn48s on the first
try for example). I noticed that it was a bit slow a week ago but not as
much (there was ~half as much data on them at the time). OSDs on
individual disks don't exhibit this problem (with warm caches init
finishes in ~4s on the small Btrfs volumes, ~3s on the XFS volumes) but
they are on dedicated disks with less data.

With warm caches most of the time is spent between:
"osd.<n> <osdmap> load_pgs"
"osd.<n> <osdmap> load_pgs opened <m> pgs"
log lines in /var/log/ceph/ceph-osd.<n>.log (m is ~650 for both OSD). So
it seems most of the time is spent opening pgs.

What could explain such long startup times? Is the OSD init doing a lot
of random disk accesses? Is it dependant on the volume of data or the
history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known
performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the
data and a similar history of disk accesses we have 1/10 the init time
when the disks are in both cases idle)?

# Init snap destroy errors on Btrfs question

During each init on Btrfs backed OSD we get this kind of errors in
ceph-osd logs (always come in pairs like that at the very beginning of
the phase where the OSD opens the pgs):

2014-10-13 23:54:44.143039 7fd4267fc700  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-1) destroy_checkpoint:
ioctl SNAP_DESTROY got (2) No such file or directory
2014-10-13 23:54:44.143087 7fd4267fc700 -1
filestore(/var/lib/ceph/osd/ceph-1) unable to destroy snap
'snap_21161231' got (2) No such file or directory
2014-10-13 23:54:44.266149 7fd4267fc700  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-1) destroy_checkpoint:
ioctl SNAP_DESTROY got (2) No such file or directory
2014-10-13 23:54:44.266189 7fd4267fc700 -1
filestore(/var/lib/ceph/osd/ceph-1) unable to destroy snap
'snap_21161268' got (2) No such file or directory

I suppose it is harmless (at least these OSD don't show any other
error/warning and have been restarted and their filesystem remounted on
numerous occasions), but I'd like to be sure: is it?

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com