Re: OSDs failing to start after host reboot

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018/01/29 1:45 pm, Alfredo Deza wrote:
On Mon, Jan 29, 2018 at 1:37 PM, Andre Goree <andre@xxxxxxxxxx> wrote:
On 2018/01/29 12:28 pm, Alfredo Deza wrote:

On Mon, Jan 29, 2018 at 10:55 AM, Andre Goree <andre@xxxxxxxxxx> wrote:

On my OSD node that I built with ceph-ansible, the OSDs are failing to
start
after a reboot.


This is not uncommon for ceph-disk unfortunately, and one of the
reasons we have introduced ceph-volume. There are a few components
that can
cause this, you may find that rebooting your node will yield different
results, some times other OSDs will come up (or all of them even!)

If you search the tracker, or even this mailing list, you will see
this is nothing new.

ceph-ansible has the ability to deploy using ceph-volume, which
doesn't suffer from the same caveats, you might want to try it out (if
possible)



Thank you, yes I did see that this apparently happens (happened?) often
after many hours of internet searching.  Very unfortunate.

The only issue I see with ceph-volume (at least with ceph-ansible) is that it MUST use LVM, which we'd like to avoid. But if we cannot reboot our OSD hosts for fear of them not being able to come back online, that is perhaps
something we'll have to reconsider.

Does ceph-volume work without LVM when manually creating things?

Yes it does! It even accepts previously created OSDs (either manual or
via ceph-disk) and can manage them for you.

That means: it will disable the problematic ceph-disk/udev interaction
by overriding the systemd units, and will map the newly captured OSD
details
to ceph-volume systemd units.

You will need to perform a 'scan' of the running OSD (although there
is functionality to scan a partition that is not mounted as well), so
that the details needed to manage it
will get persisted.

Make sure that the JSON output looks correct, so that the systemd
units can have correct data.

More details at:

http://docs.ceph.com/docs/master/ceph-volume/simple/



Thanks, I was actually reading that document and came to respond again. No support for encrypted OSDs, which we definitely need.

And, as it turns out, the encryption might actually be my whole issue since I'm seeing things finally spit out errors to the logs some 2-3 hours later:

Jan 29 13:00:59 osd-08 sh[3101]: Traceback (most recent call last):
Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/sbin/ceph-disk", line 9, in <module> Jan 29 13:00:59 osd-08 sh[3101]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')() Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
Jan 29 13:00:59 osd-08 sh[3101]:     main(sys.argv[1:])
Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5674, in main
Jan 29 13:00:59 osd-08 sh[3101]:     args.func(args)
Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5421, in <lambda> Jan 29 13:00:59 osd-08 sh[3101]: func=lambda args: main_activate_space(name, args), Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4136, in main_activate_space Jan 29 13:00:59 osd-08 sh[3101]: dev = dmcrypt_map(args.dev, args.dmcrypt_key_dir) Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3464, in dmcrypt_map Jan 29 13:00:59 osd-08 sh[3101]: dmcrypt_key = get_dmcrypt_key(part_uuid, dmcrypt_key_dir, luks) Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 1325, in get_dmcrypt_key Jan 29 13:00:59 osd-08 sh[3101]: raise Error('unknown key-management-mode ' + str(mode)) Jan 29 13:00:59 osd-08 sh[3101]: ceph_disk.main.Error: Error: unknown key-management-mode None Jan 29 13:00:59 osd-08 sh[3101]: /usr/lib/python2.7/dist-packages/ceph_disk/main.py:5677: UserWarning: Jan 29 13:00:59 osd-08 sh[3101]: ******************************************************************************* Jan 29 13:00:59 osd-08 sh[3101]: This tool is now deprecated in favor of ceph-volume. Jan 29 13:00:59 osd-08 sh[3101]: It is recommended to use ceph-volume for OSD deployments. For details see: Jan 29 13:00:59 osd-08 sh[3101]: http://docs.ceph.com/docs/master/ceph-volume/#migrating Jan 29 13:00:59 osd-08 sh[3101]: *******************************************************************************
Jan 29 13:00:59 osd-08 sh[3101]:   warnings.warn(DEPRECATION_WARNING)
Jan 29 13:00:59 osd-08 sh[3101]: Traceback (most recent call last):
Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/sbin/ceph-disk", line 9, in <module> Jan 29 13:00:59 osd-08 sh[3101]: load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')() Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
Jan 29 13:00:59 osd-08 sh[3101]:     main(sys.argv[1:])
Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5674, in main
Jan 29 13:00:59 osd-08 sh[3101]:     args.func(args)
Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4874, in main_trigger Jan 29 13:00:59 osd-08 sh[3101]: raise Error('return code ' + str(ret)) Jan 29 13:00:59 osd-08 sh[3101]: ceph_disk.main.Error: Error: return code 1


So I'm wondering what my options are at this point. Perhaps rebuild this OSD node, using ceph-volume and 'simple', but would not be able to use encryption?

And I should probably be wary of any of the other current OSD nodes going down bc they likely will experience the same issue? Given all this, we'll probably need to rebuild all the OSD nodes in the cluster to make sure the can be rebooted reliably? That's really unfortunate :(


--
Andre Goree
-=-=-=-=-=-
Email     - andre at drenet.net
Website   - http://blog.drenet.net
PGP key   - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux