Re: OSD id 241 != my id 248: conversion from "ceph-disk" to "ceph-volume simple" destroys OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Chris,

thanks for looking at this issue in more detail.

I have two communications on this issue and I'm afraid you didn't get all information. There seem to be at least 2 occurrences of the same bug. Yes, I'm pretty sure data.path should also be a stable device path instead of /dev/sdq1. But this is the second occurrence of this bug, the other one is for block.path, which is not visible in the communication I sent to you but has more dramatic consequences.

Please find below the full story. Unless you can do it, I will file a ticket. To me this looks like a general occurrence of using unstable device paths by accident that should be tracked down everywhere. If you can fix the code, you might want to add a comment to it to make sure the same mistake is not repeated.

Problems:

- ceph-volume simple scan|activate use unstable device paths like "/dev/sd??" instead of stable device paths like "/dev/disk/by-partuuid/UUID", which leads to OSD boot fails when devices are renamed at reboot by the kernel

- ceph-volume simple activate modifies (!!!) OSD meta data from a stable device path to an unstable device path, which does not only lead to boot fails but also makes it impossible to move an OSD to a different host, because ceph-volume simple scan will now produce a corrupted json config file

Setup and observation:

I observed this in the situation where after a reboot all disks were re-named. I have a work-flow that deploys containers per physical disk slot and performs a full OSD discovery at every container start to accommodate exchanging OSDs. The basic sequence executed every time is:

ceph-volume simple scan
ceph volume simple activate

Unfortunately, this sequence is not idempotent, because ceph volume simple activate modifies (!!!) the symbolic link "block" on the OSD data partition to point to an unstable device path, for example (note the first occurrence of the unstable device path /dev/sdq1 in data.path):

# mount /dev/sdq1 mnt
# ls -l mnt
[...]
lrwxrwxrwx. 1 root root  58 Mar 11 16:17 block -> /dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4
[...]
# umount mnt
# ceph-volume simple scan --stdout /dev/sdq1                                                             
Running command: /usr/sbin/cryptsetup status /dev/sdq1
Running command: /usr/bin/mount -v /dev/sdq1 /tmp/tmpmfitNx
 stdout: mount: /dev/sdq1 mounted on /tmp/tmpmfitNx.
Running command: /usr/bin/umount -v /tmp/tmpmfitNx
 stderr: umount: /tmp/tmpmfitNx (/dev/sdq1) unmounted
{
    "active": "ok", 
    "block": {
        "path": "/dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4", 
        "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4"
    }, 
    "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4", 
    "bluefs": 1, 
    "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9", 
    "cluster_name": "ceph", 
    "data": {
        "path": "/dev/sdq1", 
        "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15"
    }, 
    "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15", 
    "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==", 
    "kv_backend": "rocksdb", 
    "magic": "ceph osd volume v026", 
    "mkfs_done": "yes", 
    "none": "", 
    "ready": "ready", 
    "require_osd_release": "", 
    "type": "bluestore", 
    "whoami": 59
}
# ceph-volume simple activate --file "/etc/ceph/osd/59-9b88d6ec-87a4-4640-b80e-81d3d56fac15.json" --no-systemd
Running command: /usr/bin/mount -v /dev/sdq1 /var/lib/ceph/osd/ceph-59
 stdout: mount: /dev/sdq1 mounted on /var/lib/ceph/osd/ceph-59.
Running command: /usr/bin/ln -snf /dev/sdq2 /var/lib/ceph/osd/ceph-59/block       <<<--- Oh no !!!
Running command: /usr/bin/chown -R ceph:ceph /dev/sdq2
--> Skipping enabling of `simple` systemd unit
--> Skipping masking of ceph-disk systemd units
--> Skipping enabling and starting OSD simple systemd unit because --no-systemd was used
--> Successfully activated OSD 59 with FSID 9b88d6ec-87a4-4640-b80e-81d3d56fac15

# !!! Note the command "/usr/bin/ln -snf /dev/sdq2 /var/lib/ceph/osd/ceph-59/block" in the output,
# which is corrupting the OSDs meta-data!

# ls -l /var/lib/ceph/osd/ceph-59
[...]
lrwxrwxrwx. 1 root root   9 Mar 12 13:06 block -> /dev/sdq2
[...]

# This OSD now holds corrupted meta-data in form of a symbolic link with an unstable device path
# as its link target. Subsequent discoveries now produce corrupt .json config files and moving this disk
# to another host has turned into a real pain:

# umount /var/lib/ceph/osd/ceph-59
# ceph-volume simple scan --stdout /dev/sdq1
Running command: /usr/sbin/cryptsetup status /dev/sdq1
Running command: /usr/bin/mount -v /dev/sdq1 /tmp/tmpABkQsj
 stdout: mount: /dev/sdq1 mounted on /tmp/tmpABkQsj.
Running command: /usr/bin/umount -v /tmp/tmpABkQsj
 stderr: umount: /tmp/tmpABkQsj (/dev/sdq1) unmounted
{
    "active": "ok", 
    "block": {
        "path": "/dev/sdq2", 
        "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4"
    }, 
    "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4", 
    "bluefs": 1, 
    "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9", 
    "cluster_name": "ceph", 
    "data": {
        "path": "/dev/sdq1", 
        "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15"
    }, 
    "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15", 
    "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==", 
    "kv_backend": "rocksdb", 
    "magic": "ceph osd volume v026", 
    "mkfs_done": "yes", 
    "none": "", 
    "ready": "ready", 
    "require_osd_release": "", 
    "type": "bluestore", 
    "whoami": 59
}

Here in this example, the disk names didn't change, which implies that this OSD will still start as long as the disk is named /dev/sdq. However, if the disk names change, ceph-volume simple scan unfortunately follows the broken symlink link instead of using block_uuid for discovery, which leads to a completely corrupted .json file similar to this one:

# ceph-volume simple scan --stdout /dev/sdb1
Running command: /usr/sbin/cryptsetup status /dev/sdb1
{
    "active": "ok",
    "block": {
        "path": "/dev/sda2",
        "uuid": "b5ac1462-510a-4483-8f42-604e6adc5c9d"
    },
    "block_uuid": "1d9d89a2-18c7-4610-9dcd-167d44ce1879",
    "bluefs": 1,
    "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
    "cluster_name": "ceph",
    "data": {
        "path": "/dev/sdb1",
        "uuid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb"
    },
    "fsid": "c35a7efb-8c1c-42a1-8027-cf422d7e7ecb",
    "keyring": "AQAZJ6ddedALDxAAJI7NLJ2CRFoQWK5STRpHuw==",
    "kv_backend": "rocksdb",
    "magic": "ceph osd volume v026",
    "mkfs_done": "yes",
    "none": "",
    "ready": "ready",
    "require_osd_release": "",
    "type": "bluestore",
    "whoami": 241
}

Notice that now block_uuid and block.uuid do not match any more. This corruption requires manual repair and I had to do this for an entire cluster.

Resolution:

I ended up with all OSDs I converted from "ceph-disk" to "ceph-volume simple" failing to boot after a server reboot that shifted the device names and all symbolic links to the block device were invalidated. Fortunately, the OSDs recognised that the block device partition was for another OSD ID and exited with an error, otherwise I would probably have lost data. To fix this, I needed to write a script that resets the link target of the symlink "block" to the correct part_uuip path.

Using unstable device paths is one thing that can happen by accident. However, what I really do not understand is, why "ceph-volume simple activate" *modifies* meta-data that should be considered read-only. I found this here in the code src/ceph-volume/ceph_volume/devices/simple/activate.py:200-203:

            # always re-do the symlink regardless if it exists, so that the journal
            # device path that may have changed can be mapped correctly every time
            destination = os.path.join(osd_dir, name)
            process.run(['ln', '-snf', device, destination])

Maybe the intention is correct, I don't know. However, the execution is not. At this point, a dictionary of UUIDs should be used with explicit link targets as in "/dev/disk/by-partuuid/"+uuid instead of "device" to make absolutely sure nothing gets rigged here. I think a correct version of the code in src/ceph-volume/ceph_volume/devices/simple/activate.py:190-206 would look something like this

        uuid_map = {
            'journal': osd_metadata.get('journal', {}).get('uuid'),
            'block': osd_metadata.get('block', {}).get('uuid'),
            'block.db': osd_metadata.get('block.db', {}).get('uuid'),
            'block.wal': osd_metadata.get('block.wal', {}).get('uuid')
        }

        for name, uuid in uuid_map.items():
            if not uuid:
                continue
            # always re-do the symlink regardless if it exists, so that the journal
            # device path that may have changed can be mapped correctly every time
            destination = os.path.join(osd_dir, name)
            process.run(['ln', '-snf', '/dev/disk/by-partuuid/'+uuid, destination])

            # make sure that the journal has proper permissions
            system.chown(self.get_device(uuid))

This will be very explicit about using stable device paths. Needless to say that other occurrences as in src/ceph-volume/ceph_volume/devices/simple/scan.py:89-90 should be addressed as well, for example:

        device_metadata['uuid'] = device_uuid
        device_metadata['path'] = device

could be corrected in a similar way:

        device_metadata['uuid'] = device_uuid
        device_metadata['path'] = '/dev/disk/by-partuuid/'+device_uuid

There are probably more locations that deserve a good looking at.

Hope that explains the calamities I found myself in.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Chris Dunlop <chris@xxxxxxxxxxxx>
Sent: 11 March 2021 23:46:08
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  OSD id 241 != my id 248: conversion from "ceph-disk" to "ceph-volume simple" destroys OSDs

Hi Frank,

I agree there's a problem there. Howewever, to clarify: the json file
already contains the /dev/sdq1 path (at data:path) and the "simple activate"
is just reading the file. I.e. the problem lies with the json file creator,
which was the "ceph-volume simple scan" step.

For fix your immediate issue I'd suggest fixing the existing json files to
point data:path to the by-partuuid path. That should allow your "simple
activate" to work with the stable paths.

In general it seems something in the "ceph-volume simple scan" is doing a
"realpath()" or "readlink()" on the by-partuuid path to get to the /dev/sdq1
path.

Oh, it's likely this is the culprit:

src/ceph-volume/ceph_volume/devices/simple/scan.py
class Scan(object):
     ...
     def scan_device(self, path):
         ...
         if os.path.islink(path):
             device = os.readlink(path)
         else:
             device = path

I'm not sure what the general fix might be - there may be good reason to
prefer the symlink destination path, e.g. the next steps use the 'device'
var to look for lvm info which may not work with the original symlink path.
I'll leave it up to the developers to work out a proper solution to this!

If you haven't already it's probably worth opening a ticket.

Cheers,

Chris

On Thu, Mar 11, 2021 at 02:19:57PM +0000, Frank Schilder wrote:
> Hi Chris,
>
> I found the problem. "ceph-volume simple activate" modifies the OSD's meta data in an invalid way.
>
> On a pre lvm-converted ceph-disk OSD I had in my cupboard:
>
> [root@ceph-adm:ceph-20 ~]# mount /dev/sdq1 mnt
> [root@ceph-adm:ceph-20 ~]# ls -l mnt
> [...]
> lrwxrwxrwx. 1 ceph ceph  58 Mar 15  2019 block -> /dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4
> [..]
> [root@ceph-adm:ceph-20 ~]# umount mnt
>
> [root@ceph-adm:ceph-20 ~]# cat /etc/ceph/osd/59-9b88d6ec-87a4-4640-b80e-81d3d56fac15.json
> {
>    "active": "ok",
>    "block": {
>        "path": "/dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4",
>        "uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4"
>    },
>    "block_uuid": "a1e5ef7d-9bab-4911-abe5-9075b91d88a4",
>    "bluefs": 1,
>    "ceph_fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
>    "cluster_name": "ceph",
>    "data": {
>        "path": "/dev/sdq1",
>        "uuid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15"
>    },
>    "fsid": "9b88d6ec-87a4-4640-b80e-81d3d56fac15",
>    "keyring": "AQBP4opcBeCYOxAA4sOpTthNE6T28WUf4Bgm3w==",
>    "kv_backend": "rocksdb",
>    "magic": "ceph osd volume v026",
>    "mkfs_done": "yes",
>    "none": "",
>    "ready": "ready",
>    "require_osd_release": "",
>    "type": "bluestore",
>    "whoami": 59
> }
>
> Now, "ceph-volume simple activate" modifies the symlink "block" to point to an unstable path:
>
> [root@ceph-adm:ceph-20 ~]# ceph-volume simple activate --file "/etc/ceph/osd/59-9b88d6ec-87a4-4640-b80e-81d3d56fac15.json" --no-systemd
> Running command: /usr/bin/mount -v /dev/sdq1 /var/lib/ceph/osd/ceph-59
> stdout: mount: /dev/sdq1 mounted on /var/lib/ceph/osd/ceph-59.
> Running command: /usr/bin/ln -snf /dev/sdq2 /var/lib/ceph/osd/ceph-59/block
> Running command: /usr/bin/chown -R ceph:ceph /dev/sdq2
> --> Skipping enabling of `simple` systemd unit
> --> Skipping masking of ceph-disk systemd units
> --> Skipping enabling and starting OSD simple systemd unit because --no-systemd was used
> --> Successfully activated OSD 59 with FSID 9b88d6ec-87a4-4640-b80e-81d3d56fac15
>
> Its the command "/usr/bin/ln -snf /dev/sdq2 /var/lib/ceph/osd/ceph-59/block" that destroys the integrity of the OSD. If you reboot the machine and the devices get different names, the next execution of "ceph-volume simple scan" will produce a corrupted meta data file. This will also happen if you move a converted OSD to another host and try to scan+start it.
>
> The change of the symbolic link to an unstable device path is a critical bug and I don't even understand why it happens in the first place. There is no point and the only valid link target would be "/dev/disk/by-partuuid/a1e5ef7d-9bab-4911-abe5-9075b91d88a4" any ways.
>
> I can work aroud that by resetting the link to its correct value after activation. However, this should really be fixed.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux