Re: A couple OSDs not starting after host reboot

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alison,

I have observed exactly that with OSDs "converted" from ceph-disk to ceph-volume. Someone thought it would be a great idea to store the /dev-device name in the config instead of the uuid or any other stable device path:

# cat /etc/ceph/osd/287-2eaf591b-bced-4097-9499-5fda071c6161.json
{
...
    "block": {
        "path": "/dev/disk/by-partuuid/0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649",
        "uuid": "0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649"
    },
...
    "data": {
        "path": "/dev/sdm1",
        "uuid": "2eaf591b-bced-4097-9499-5fda071c6161"
    },
...
}

Funnily enough, it has the by-uuid path stored as well, but the /dev path is actually used during activation. My "fix" is to re-generate the OSD-json just before every ceph-disk OSD start.

You seem to be using LVM OSDs already, so this is a bit weird (can't be the exact same issue). Still, I would not be surprised if you are bitten by something similar, some stored config (cache) overrides the actual drive location. It is really a bliss that the developers implemented a check that a partition actually points to the data with the correct OSD ID, otherwise our cluster would be rigged by now.

I would start by using low-level commands (ceph-volume) directly to see if the issue is low-level or sits in some higher-level interface. Log-in to the OSD node and check what "ceph-volume inventory" says and if you can manually activate/deactivate the OSD on disk (be careful to include the --no-systemd option everywhere to avoid unintended change of persistent configurations).

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: apeisker@xxxxxxxx <apeisker@xxxxxxxx>
Sent: Friday, August 25, 2023 10:29 PM
To: ceph-users@xxxxxxx
Subject:  Re: A couple OSDs not starting after host reboot

Hi,

Thank you for your reply. I don’t think the device names changed, but ceph seems to be confused about which device the OSD is on. It’s reporting that there are 2 OSDs on the same device although this is not true.

ceph device ls-by-host <osd-node> | grep sdu
ATA_HGST_HUH728080ALN600_VJH4GLUX sdu  osd.665
ATA_HGST_HUH728080ALN600_VJH60MAX sdu  osd.657

The osd.665 is actually on device sdm. Could this be the cause of the issue? Is there a way to correct it?
Thanks,
Alison
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux