ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it needs increasing and/or removing entirely. Should I copy this to ceph-devel?

On 15/09/17 16:48, Matthew Vernon wrote:
On 14/09/17 16:26, Götz Reinicke wrote:
After that, 10 OSDs did not came up as the others. The disk did not get
mounted and the OSD processes did nothing … even after a couple of
minutes no more disks/OSDs showed up.

I'm still digging, but AFAICT it's a race condition in startup - in our
case, we're only seeing it if some of the filesystems aren't clean. This
may be related to the thread "Very slow start of osds after reboot" from
August, but I don't think any conclusion was reached there.

This annoyed me enough that I went off to find the problem :-)

On systemd-enabled machines[0] ceph disks are activated by systemd's ceph-disk@.service, which calls:

/bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'

ceph-disk trigger --sync calls ceph-disk activate which (among other things) mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/ once it's extracted the osd number from the fs). If the fs is unclean, XFS auto-recovers before mounting (which takes time - range 2-25s for our 6TB disks) Importantly, there is a single global lock file[1] so only one ceph-disk activate can be doing this at once.

So, each fs is auto-recovering one at at time (rather than in parallel), and once the elapsed time gets past 120s, timeout kills the flock, systemd kills the cgroup, and no more OSDs start up - we typically find a few fs mounted in /var/lib/ceph/tmp/mnt.XXXX. systemd keeps trying to start the remaining osds (via ceph-osd@.service), but their fs isn't in the correct place, so this never works.

The fix/workaround is to adjust the timeout value (edit the service file directly, or for style points write an override in /etc/systemd/system remembering you need a blank ExecStart line before your revised one).

Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to start all its osds when started up with all fss dirty. So the current 120s is far too small (it's just about OK when all the osd fss are clean).

I think, though, that having the timeout at all is a bug - if something needs to time out under some circumstances, should it be at a lower layer, perhaps?

A couple of final points/asides, if I may:

ceph-disk trigger uses subprocess.communicate (via the command() function), which means it swallows the log output from ceph-disk activate (and only outputs it after that process finishes) - as well as producing confusing timestamps, this means that when systemd kills the cgroup, all the output from the ceph-disk activate command vanishes into the void. That made debugging needlessly hard. Better to let called processes like that output immediately?

Does each fs need mounting twice? could the osd be encoded in the partition label or similar instead?

Is a single global activation lock necessary? It slows startup down quite a bit; I see no reason why (at least in the one-osd-per-disk case) you couldn't be activating all the osds at once...

Regards,

Matthew

[0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the timeout, so presumably upstart systems aren't affected
[1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu


--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux