Re: OSDs failing to start after host reboot

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 29, 2018 at 1:56 PM, Andre Goree <andre@xxxxxxxxxx> wrote:
> On 2018/01/29 1:45 pm, Alfredo Deza wrote:
>>
>> On Mon, Jan 29, 2018 at 1:37 PM, Andre Goree <andre@xxxxxxxxxx> wrote:
>>>
>>> On 2018/01/29 12:28 pm, Alfredo Deza wrote:
>>>>
>>>>
>>>> On Mon, Jan 29, 2018 at 10:55 AM, Andre Goree <andre@xxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> On my OSD node that I built with ceph-ansible, the OSDs are failing to
>>>>> start
>>>>> after a reboot.
>>>>
>>>>
>>>>
>>>> This is not uncommon for ceph-disk unfortunately, and one of the
>>>> reasons we have introduced ceph-volume. There are a few components
>>>> that can
>>>> cause this, you may find that rebooting your node will yield different
>>>> results, some times other OSDs will come up (or all of them even!)
>>>>
>>>> If you search the tracker, or even this mailing list, you will see
>>>> this is nothing new.
>>>>
>>>> ceph-ansible has the ability to deploy using ceph-volume, which
>>>> doesn't suffer from the same caveats, you might want to try it out (if
>>>> possible)
>>>>
>>>>
>>>
>>> Thank you, yes I did see that this apparently happens (happened?) often
>>> after many hours of internet searching.  Very unfortunate.
>>>
>>> The only issue I see with ceph-volume (at least with ceph-ansible) is
>>> that
>>> it MUST use LVM, which we'd like to avoid.  But if we cannot reboot our
>>> OSD
>>> hosts for fear of them not being able to come back online, that is
>>> perhaps
>>> something we'll have to reconsider.
>>>
>>> Does ceph-volume work without LVM when manually creating things?
>>
>>
>> Yes it does! It even accepts previously created OSDs (either manual or
>> via ceph-disk) and can manage them for you.
>>
>> That means: it will disable the problematic ceph-disk/udev interaction
>> by overriding the systemd units, and will map the newly captured OSD
>> details
>> to ceph-volume systemd units.
>>
>> You will need to perform a 'scan' of the running OSD (although there
>> is functionality to scan a partition that is not mounted as well), so
>> that the details needed to manage it
>> will get persisted.
>>
>> Make sure that the JSON output looks correct, so that the systemd
>> units can have correct data.
>>
>> More details at:
>>
>> http://docs.ceph.com/docs/master/ceph-volume/simple/
>>
>
>
> Thanks, I was actually reading that document and came to respond again.  No
> support for encrypted OSDs, which we definitely need.
>
> And, as it turns out, the encryption might actually be my whole issue since
> I'm seeing things finally spit out errors to the logs some 2-3 hours later:
>
> Jan 29 13:00:59 osd-08 sh[3101]: Traceback (most recent call last):
> Jan 29 13:00:59 osd-08 sh[3101]:   File "/usr/sbin/ceph-disk", line 9, in
> <module>
> Jan 29 13:00:59 osd-08 sh[3101]:     load_entry_point('ceph-disk==1.0.0',
> 'console_scripts', 'ceph-disk')()
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
> Jan 29 13:00:59 osd-08 sh[3101]:     main(sys.argv[1:])
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5674, in main
> Jan 29 13:00:59 osd-08 sh[3101]:     args.func(args)
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5421, in <lambda>
> Jan 29 13:00:59 osd-08 sh[3101]:     func=lambda args:
> main_activate_space(name, args),
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4136, in
> main_activate_space
> Jan 29 13:00:59 osd-08 sh[3101]:     dev = dmcrypt_map(args.dev,
> args.dmcrypt_key_dir)
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3464, in
> dmcrypt_map
> Jan 29 13:00:59 osd-08 sh[3101]:     dmcrypt_key =
> get_dmcrypt_key(part_uuid, dmcrypt_key_dir, luks)
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 1325, in
> get_dmcrypt_key
> Jan 29 13:00:59 osd-08 sh[3101]:     raise Error('unknown
> key-management-mode ' + str(mode))
> Jan 29 13:00:59 osd-08 sh[3101]: ceph_disk.main.Error: Error: unknown
> key-management-mode None
> Jan 29 13:00:59 osd-08 sh[3101]:
> /usr/lib/python2.7/dist-packages/ceph_disk/main.py:5677: UserWarning:
> Jan 29 13:00:59 osd-08 sh[3101]:
> *******************************************************************************
> Jan 29 13:00:59 osd-08 sh[3101]: This tool is now deprecated in favor of
> ceph-volume.
> Jan 29 13:00:59 osd-08 sh[3101]: It is recommended to use ceph-volume for
> OSD deployments. For details see:
> Jan 29 13:00:59 osd-08 sh[3101]:
> http://docs.ceph.com/docs/master/ceph-volume/#migrating
> Jan 29 13:00:59 osd-08 sh[3101]:
> *******************************************************************************
> Jan 29 13:00:59 osd-08 sh[3101]:   warnings.warn(DEPRECATION_WARNING)
> Jan 29 13:00:59 osd-08 sh[3101]: Traceback (most recent call last):
> Jan 29 13:00:59 osd-08 sh[3101]:   File "/usr/sbin/ceph-disk", line 9, in
> <module>
> Jan 29 13:00:59 osd-08 sh[3101]:     load_entry_point('ceph-disk==1.0.0',
> 'console_scripts', 'ceph-disk')()
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5736, in run
> Jan 29 13:00:59 osd-08 sh[3101]:     main(sys.argv[1:])
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5674, in main
> Jan 29 13:00:59 osd-08 sh[3101]:     args.func(args)
> Jan 29 13:00:59 osd-08 sh[3101]:   File
> "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4874, in
> main_trigger
> Jan 29 13:00:59 osd-08 sh[3101]:     raise Error('return code ' + str(ret))
> Jan 29 13:00:59 osd-08 sh[3101]: ceph_disk.main.Error: Error: return code 1
>
>
> So I'm wondering what my options are at this point.  Perhaps rebuild this
> OSD node, using ceph-volume and 'simple', but would not be able to use
> encryption?

Ungh, I forgot to mention that there is no encryption support.

However, ceph-volume lvm gained encryption support last week
(available in master), and we are working
on encryption support for `simple` and we are almost there.

These features will probably end up in Mimic, not in Luminous. If
encryption is a must, I am not sure there is any other way than
relying in ceph-disk.


>
> And I should probably be wary of any of the other current OSD nodes going
> down bc they likely will experience the same issue?  Given all this, we'll
> probably need to rebuild all the OSD nodes in the cluster to make sure the
> can be rebooted reliably?  That's really unfortunate :(

The process is convoluted at system startup mostly (I believe). The
way I've seen that this might work is that users poke the activation
manually
until the OSD comes up.

In short: there is no encryption support in Luminous for ceph-volume,
encryption will be available in Mimic (for both `simple` and `lvm`).
There is currently
no other way to fully guarantee OSDs are up and running after a reboot.

>
>
>
> --
> Andre Goree
> -=-=-=-=-=-
> Email     - andre at drenet.net
> Website   - http://blog.drenet.net
> PGP key   - http://www.drenet.net/pubkey.html
> -=-=-=-=-=-
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux