Re: Shared WAL/DB device partition for multiple OSDs?

Hervé Ballans <herve.ballans@xxxxxxxxxxxxx> · Thu, 23 Aug 2018 15:12:59 +0200

Le 23/08/2018 à 12:51, Alfredo Deza a écrit :
On Thu, Aug 23, 2018 at 5:42 AM, Hervé Ballans
<herve.ballans@xxxxxxxxxxxxx> wrote:
Hello all,

I would like to continue a thread that dates back to last May (sorry if this
is not a good practice ?..)

Thanks David for your usefil tips on this thread.
In my side, I created my OSDs with ceph-deploy (in place of ceph-volume)
[1], but this is exactly the same context as this mentioned on this thread
(hdd  drive for OSDs and wal/db partitions on NVMe device).

The problem I encounter is that the script that fixes block.db partitions by
their UUID works very well in live but does not resist to the reboot of the
OSD node. If I restart the server, the symbolic links of block.db
automatically go up with the device name /dev/nvme...
The problem gets worse when we have 2 NVMe devices on the same node beacuse
in this case, it happens that the paths to the block.db partitions are
reversed and obviously OSDs don't start !
You didn't mention what versions of ceph-deploy and Ceph you are
using. Since you brought up partitions and OSDs that are not coming
up, it seems
that is related to using ceph-disk and ceph-deploy 1.5.X

I would suggest trying out the newer version of ceph-deploy (2.0.X)
and use ceph-volume, the one caveat being if you need a separate
block.db on the NVMe device
you would need to create the LV yourself.

Thanks Alfredo for your reply. I'm using the very last version of 
Luminous (12.2.7) and ceph-deploy (2.0.1).
I have no problem in creating my OSD, that's work perfectly.
My issue only concerns the problem of the mount names of the NVMe 
partitions which change after a reboot when there are more than one NVMe 
device on the OSD node.

For instance, if I have two NVMe devices, the first time, the first 
device is mounted with name /dev/nvme0n1 and the second device with name 
/dev/nvme1n1. After node restart, these names can be reversed, that is, 
the first device named /dev/nvme1n1 and the second one /dev/nvme0n1 ! 
The result is that OSDs no longer find their metadata and do not start up...

Some of the manual steps are covered in the bluestore config
reference: http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#block-and-block-db
As I'm not yet in production, I can probably recreate all my OSDs by forcing
the path to the block.db partitions with UUID, but I would like to know if
there was a way to "freeze" the configuration of block.db paths by their
UUID ("a posteriori") ?

Or maybe (but this is more a system administration issue) that there is a
way on Linux system to force an NVMe disk to be mounted with a fixed device
name ? (I specify here that my NVMe partitions do not have a filesystem).

Thanks for your help,
Hervé

[1] from admin node
ceph-deploy osd create --debug --bluestore --data $hdd --block-db $db
$osdnode

Le 11/05/2018 à 18:46, David Turner a écrit :

# Create the OSD
echo "/dev/sdb /dev/nvme0n1p2
/dev/sdc /dev/nvme0n1p3" | while read hdd db; do
   ceph-volume lvm create --bluestore --data $hdd --block.db $db
done

# Fix the OSDs to look for the block.db partition by UUID instead of its
device name.
for db in /var/lib/ceph/osd/*/block.db; do
   dev=$(readlink $db | grep -Eo nvme[[:digit:]]+n[[:digit:]]+p[[:digit:]]+
|| echo false)
   if [[ "$dev" != false ]]; then
     uuid=$(ls -l /dev/disk/by-partuuid/ | awk '/'${dev}'$/ {print $9}')
     ln -sf /dev/disk/by-partuuid/$uuid $db
   fi
done
systemctl restart ceph-osd.target

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com