On 2018/01/29 1:45 pm, Alfredo Deza wrote:
On Mon, Jan 29, 2018 at 1:37 PM, Andre Goree <andre@xxxxxxxxxx> wrote:
On 2018/01/29 12:28 pm, Alfredo Deza wrote:
On Mon, Jan 29, 2018 at 10:55 AM, Andre Goree <andre@xxxxxxxxxx>
wrote:
On my OSD node that I built with ceph-ansible, the OSDs are failing
to
start
after a reboot.
This is not uncommon for ceph-disk unfortunately, and one of the
reasons we have introduced ceph-volume. There are a few components
that can
cause this, you may find that rebooting your node will yield
different
results, some times other OSDs will come up (or all of them even!)
If you search the tracker, or even this mailing list, you will see
this is nothing new.
ceph-ansible has the ability to deploy using ceph-volume, which
doesn't suffer from the same caveats, you might want to try it out
(if
possible)
Thank you, yes I did see that this apparently happens (happened?)
often
after many hours of internet searching. Very unfortunate.
The only issue I see with ceph-volume (at least with ceph-ansible) is
that
it MUST use LVM, which we'd like to avoid. But if we cannot reboot
our OSD
hosts for fear of them not being able to come back online, that is
perhaps
something we'll have to reconsider.
Does ceph-volume work without LVM when manually creating things?
Yes it does! It even accepts previously created OSDs (either manual or
via ceph-disk) and can manage them for you.
That means: it will disable the problematic ceph-disk/udev interaction
by overriding the systemd units, and will map the newly captured OSD
details
to ceph-volume systemd units.
You will need to perform a 'scan' of the running OSD (although there
is functionality to scan a partition that is not mounted as well), so
that the details needed to manage it
will get persisted.
Make sure that the JSON output looks correct, so that the systemd
units can have correct data.
More details at:
http://docs.ceph.com/docs/master/ceph-volume/simple/
Thanks, I was actually reading that document and came to respond again.
No support for encrypted OSDs, which we definitely need.
And, as it turns out, the encryption might actually be my whole issue
since I'm seeing things finally spit out errors to the logs some 2-3
hours later:
Jan 29 13:00:59 osd-08 sh[3101]: Traceback (most recent call last):
Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/sbin/ceph-disk", line 9,
in <module>
Jan 29 13:00:59 osd-08 sh[3101]:
load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
Jan 29 13:00:59 osd-08 sh[3101]: main(sys.argv[1:])
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5674, in main
Jan 29 13:00:59 osd-08 sh[3101]: args.func(args)
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5421, in
<lambda>
Jan 29 13:00:59 osd-08 sh[3101]: func=lambda args:
main_activate_space(name, args),
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4136, in
main_activate_space
Jan 29 13:00:59 osd-08 sh[3101]: dev = dmcrypt_map(args.dev,
args.dmcrypt_key_dir)
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3464, in
dmcrypt_map
Jan 29 13:00:59 osd-08 sh[3101]: dmcrypt_key =
get_dmcrypt_key(part_uuid, dmcrypt_key_dir, luks)
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 1325, in
get_dmcrypt_key
Jan 29 13:00:59 osd-08 sh[3101]: raise Error('unknown
key-management-mode ' + str(mode))
Jan 29 13:00:59 osd-08 sh[3101]: ceph_disk.main.Error: Error: unknown
key-management-mode None
Jan 29 13:00:59 osd-08 sh[3101]:
/usr/lib/python2.7/dist-packages/ceph_disk/main.py:5677: UserWarning:
Jan 29 13:00:59 osd-08 sh[3101]:
*******************************************************************************
Jan 29 13:00:59 osd-08 sh[3101]: This tool is now deprecated in favor of
ceph-volume.
Jan 29 13:00:59 osd-08 sh[3101]: It is recommended to use ceph-volume
for OSD deployments. For details see:
Jan 29 13:00:59 osd-08 sh[3101]:
http://docs.ceph.com/docs/master/ceph-volume/#migrating
Jan 29 13:00:59 osd-08 sh[3101]:
*******************************************************************************
Jan 29 13:00:59 osd-08 sh[3101]: warnings.warn(DEPRECATION_WARNING)
Jan 29 13:00:59 osd-08 sh[3101]: Traceback (most recent call last):
Jan 29 13:00:59 osd-08 sh[3101]: File "/usr/sbin/ceph-disk", line 9,
in <module>
Jan 29 13:00:59 osd-08 sh[3101]:
load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
Jan 29 13:00:59 osd-08 sh[3101]: main(sys.argv[1:])
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5674, in main
Jan 29 13:00:59 osd-08 sh[3101]: args.func(args)
Jan 29 13:00:59 osd-08 sh[3101]: File
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4874, in
main_trigger
Jan 29 13:00:59 osd-08 sh[3101]: raise Error('return code ' +
str(ret))
Jan 29 13:00:59 osd-08 sh[3101]: ceph_disk.main.Error: Error: return
code 1
So I'm wondering what my options are at this point. Perhaps rebuild
this OSD node, using ceph-volume and 'simple', but would not be able to
use encryption?
And I should probably be wary of any of the other current OSD nodes
going down bc they likely will experience the same issue? Given all
this, we'll probably need to rebuild all the OSD nodes in the cluster to
make sure the can be rebooted reliably? That's really unfortunate :(
--
Andre Goree
-=-=-=-=-=-
Email - andre at drenet.net
Website - http://blog.drenet.net
PGP key - http://www.drenet.net/pubkey.html
-=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com