Re: Shared WAL/DB device partition for multiple OSDs?

Alfredo Deza <adeza@xxxxxxxxxx> · Thu, 23 Aug 2018 09:20:15 -0400

On Thu, Aug 23, 2018 at 9:12 AM, Hervé Ballans
<herve.ballans@xxxxxxxxxxxxx> wrote:
> Le 23/08/2018 à 12:51, Alfredo Deza a écrit :
>>
>> On Thu, Aug 23, 2018 at 5:42 AM, Hervé Ballans
>> <herve.ballans@xxxxxxxxxxxxx> wrote:
>>>
>>> Hello all,
>>>
>>> I would like to continue a thread that dates back to last May (sorry if
>>> this
>>> is not a good practice ?..)
>>>
>>> Thanks David for your usefil tips on this thread.
>>> In my side, I created my OSDs with ceph-deploy (in place of ceph-volume)
>>> [1], but this is exactly the same context as this mentioned on this
>>> thread
>>> (hdd  drive for OSDs and wal/db partitions on NVMe device).
>>>
>>> The problem I encounter is that the script that fixes block.db partitions
>>> by
>>> their UUID works very well in live but does not resist to the reboot of
>>> the
>>> OSD node. If I restart the server, the symbolic links of block.db
>>> automatically go up with the device name /dev/nvme...
>>> The problem gets worse when we have 2 NVMe devices on the same node
>>> beacuse
>>> in this case, it happens that the paths to the block.db partitions are
>>> reversed and obviously OSDs don't start !
>>
>> You didn't mention what versions of ceph-deploy and Ceph you are
>> using. Since you brought up partitions and OSDs that are not coming
>> up, it seems
>> that is related to using ceph-disk and ceph-deploy 1.5.X
>>
>> I would suggest trying out the newer version of ceph-deploy (2.0.X)
>> and use ceph-volume, the one caveat being if you need a separate
>> block.db on the NVMe device
>> you would need to create the LV yourself.
>
>
> Thanks Alfredo for your reply. I'm using the very last version of Luminous
> (12.2.7) and ceph-deploy (2.0.1).
> I have no problem in creating my OSD, that's work perfectly.
> My issue only concerns the problem of the mount names of the NVMe partitions
> which change after a reboot when there are more than one NVMe device on the
> OSD node.

ceph-volume is pretty resilient to partition changes because it stores
the PARTUUID of the partition in LVM, and it queries
it each time at boot. Note that for bluestore there is no mounting
whatsoever. Have you created partitions with a PARTUUID on the nvme
devices for block.db ?

>
> For instance, if I have two NVMe devices, the first time, the first device
> is mounted with name /dev/nvme0n1 and the second device with name
> /dev/nvme1n1. After node restart, these names can be reversed, that is, the
> first device named /dev/nvme1n1 and the second one /dev/nvme0n1 ! The result
> is that OSDs no longer find their metadata and do not start up...

This sounds very odd. Could you clarify where block and block.db are?
Also useful here would be to take a look at
/var/log/ceph/ceph-volume-systemd.log and ceph-volume.log to
see how ceph-volume is trying to get this OSD up and running.

Also useful would be to check `ceph-volume lvm list` to verify that
regardless of the name change, it recognizes the correct partition
mapped to the OSD
>
>
>> Some of the manual steps are covered in the bluestore config
>> reference:
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#block-and-block-db
>>>
>>> As I'm not yet in production, I can probably recreate all my OSDs by
>>> forcing
>>> the path to the block.db partitions with UUID, but I would like to know
>>> if
>>> there was a way to "freeze" the configuration of block.db paths by their
>>> UUID ("a posteriori") ?
>>>
>>> Or maybe (but this is more a system administration issue) that there is a
>>> way on Linux system to force an NVMe disk to be mounted with a fixed
>>> device
>>> name ? (I specify here that my NVMe partitions do not have a filesystem).
>>>
>>> Thanks for your help,
>>> Hervé
>>>
>>> [1] from admin node
>>> ceph-deploy osd create --debug --bluestore --data $hdd --block-db $db
>>> $osdnode
>>>
>>> Le 11/05/2018 à 18:46, David Turner a écrit :
>>>
>>> # Create the OSD
>>> echo "/dev/sdb /dev/nvme0n1p2
>>> /dev/sdc /dev/nvme0n1p3" | while read hdd db; do
>>>    ceph-volume lvm create --bluestore --data $hdd --block.db $db
>>> done
>>>
>>> # Fix the OSDs to look for the block.db partition by UUID instead of its
>>> device name.
>>> for db in /var/lib/ceph/osd/*/block.db; do
>>>    dev=$(readlink $db | grep -Eo
>>> nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+
>>> || echo false)
>>>    if [[ "$dev" != false ]]; then
>>>      uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${dev}'$/ {print $9}')
>>>      ln -sf /dev/disk/by-partuuid/$uuid $db
>>>    fi
>>> done
>>> systemctl restart ceph-osd.target
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com