Re: ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 29 Sep 2017 10:00:15 +1000

This looks similar to
https://bugzilla.redhat.com/show_bug.cgi?id=1458007 or one of the
bugs/trackers attached to that.

On Thu, Sep 28, 2017 at 11:14 PM, Sean Purdy <s.purdy@xxxxxxxxxxxxxxxx> wrote:
> On Thu, 28 Sep 2017, Matthew Vernon said:
>> Hi,
>>
>> TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it
>> needs increasing and/or removing entirely. Should I copy this to ceph-devel?
>
> Just a note.  Looks like debian stretch luminous packages have a 10_000 second timeout:
>
> from /lib/systemd/system/ceph-disk@.service
>
> Environment=CEPH_DISK_TIMEOUT=10000
> ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'
>
>
> Sean
>
>> On 15/09/17 16:48, Matthew Vernon wrote:
>> >On 14/09/17 16:26, Götz Reinicke wrote:
>> >>After that, 10 OSDs did not came up as the others. The disk did not get
>> >>mounted and the OSD processes did nothing … even after a couple of
>> >>minutes no more disks/OSDs showed up.
>> >
>> >I'm still digging, but AFAICT it's a race condition in startup - in our
>> >case, we're only seeing it if some of the filesystems aren't clean. This
>> >may be related to the thread "Very slow start of osds after reboot" from
>> >August, but I don't think any conclusion was reached there.
>>
>> This annoyed me enough that I went off to find the problem :-)
>>
>> On systemd-enabled machines[0] ceph disks are activated by systemd's
>> ceph-disk@.service, which calls:
>>
>> /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f)
>> /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'
>>
>> ceph-disk trigger --sync calls ceph-disk activate which (among other things)
>> mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/
>> once it's extracted the osd number from the fs). If the fs is unclean, XFS
>> auto-recovers before mounting (which takes time - range 2-25s for our 6TB
>> disks) Importantly, there is a single global lock file[1] so only one
>> ceph-disk activate can be doing this at once.
>>
>> So, each fs is auto-recovering one at at time (rather than in parallel), and
>> once the elapsed time gets past 120s, timeout kills the flock, systemd kills
>> the cgroup, and no more OSDs start up - we typically find a few fs mounted
>> in /var/lib/ceph/tmp/mnt.XXXX. systemd keeps trying to start the remaining
>> osds (via ceph-osd@.service), but their fs isn't in the correct place, so
>> this never works.
>>
>> The fix/workaround is to adjust the timeout value (edit the service file
>> directly, or for style points write an override in /etc/systemd/system
>> remembering you need a blank ExecStart line before your revised one).
>>
>> Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to
>> start all its osds when started up with all fss dirty. So the current 120s
>> is far too small (it's just about OK when all the osd fss are clean).
>>
>> I think, though, that having the timeout at all is a bug - if something
>> needs to time out under some circumstances, should it be at a lower layer,
>> perhaps?
>>
>> A couple of final points/asides, if I may:
>>
>> ceph-disk trigger uses subprocess.communicate (via the command() function),
>> which means it swallows the log output from ceph-disk activate (and only
>> outputs it after that process finishes) - as well as producing confusing
>> timestamps, this means that when systemd kills the cgroup, all the output
>> from the ceph-disk activate command vanishes into the void. That made
>> debugging needlessly hard. Better to let called processes like that output
>> immediately?
>>
>> Does each fs need mounting twice? could the osd be encoded in the partition
>> label or similar instead?
>>
>> Is a single global activation lock necessary? It slows startup down quite a
>> bit; I see no reason why (at least in the one-osd-per-disk case) you
>> couldn't be activating all the osds at once...
>>
>> Regards,
>>
>> Matthew
>>
>> [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the
>> timeout, so presumably upstart systems aren't affected
>> [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
>> a charity registered in England with number 1021457 and a company registered
>> in England with number 2742969, whose registered office is 215 Euston Road,
>> London, NW1 2BE. _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com