Re: Shared WAL/DB device partition for multiple OSDs?

Hervé Ballans <herve.ballans@xxxxxxxxxxxxx> · Thu, 23 Aug 2018 15:56:37 +0200



    Le 23/08/2018 à 15:20, Alfredo Deza a
      écrit :

    
        Thanks Alfredo for your reply. I'm using the very last version of Luminous
(12.2.7) and ceph-deploy (2.0.1).
I have no problem in creating my OSD, that's work perfectly.
My issue only concerns the problem of the mount names of the NVMe partitions
which change after a reboot when there are more than one NVMe device on the
OSD node.

      
      ceph-volume is pretty resilient to partition changes because it stores
the PARTUUID of the partition in LVM, and it queries
it each time at boot. Note that for bluestore there is no mounting
whatsoever. Have you created partitions with a PARTUUID on the nvme
devices for block.db ?
    
    
    Here is how I created my BlueStore OSDs (in the first OSD node) :

     
    1) On the OSD node node-osd0, I first created block
    partitions on the NVMe device (PM1725a 800GB),
    like this :

    
    # parted /dev/nvme0n1
      mklabel gpt

      
      # echo "1 0 10

      2 10 20

      3 20 30

      4 30 40

      5 40 50

      6 50 60

      7 60 70

      8 70 80

      9 80 90

      10 90 100" | while read num beg end; do parted /dev/nvme0n1 mkpart
      $num $beg% $end%; done

    
    Extract of cat /proc/partitions :

    
     259        2 
      781412184 nvme1n1

       259        3  781412184 nvme0n1

       259        5   78140416 nvme0n1p1

       259        6   78141440 nvme0n1p2

       259        7   78140416 nvme0n1p3

       259        8   78141440 nvme0n1p4

       259        9   78141440 nvme0n1p5

       259       10   78141440 nvme0n1p6

       259       11   78140416 nvme0n1p7

       259       12   78141440 nvme0n1p8

       259       13   78141440 nvme0n1p9

       259       15   78140416 nvme0n1p10

    
    2) Then, from the admin node, I created my 10 first OSDs like this :

    
    echo "/dev/sda
      /dev/nvme0n1p1

      /dev/sdb /dev/nvme0n1p2

      /dev/sdc /dev/nvme0n1p3

      /dev/sdd /dev/nvme0n1p4

      /dev/sde /dev/nvme0n1p5

      /dev/sdf /dev/nvme0n1p6

      /dev/sdg /dev/nvme0n1p7

      /dev/sdh /dev/nvme0n1p8

      /dev/sdi /dev/nvme0n1p9

      /dev/sdj /dev/nvme0n1p10" | while read hdd db; do ceph-deploy osd
      create --debug --bluestore --data $hdd --block-db $db node-osd0;
      done

    
    What you mean is that, at this stage, I must directly declare the
    UUID paths in value of --block.db (i.e. replace /dev/nvme0n1p1 with
    its PARTUUID), that is ?

    
    Currently, I created 60 OSDs like that. The ceph cluster is
    HEALTH_OK and all osds are up and in. But I'm not yet in prodcution
    and there is only test data on it, so I can destroy everything and
    rebuild my OSDs.

    That's what you advise me to do there, taking care to specify the
    PARTUUID for the block.db instead of the device names ?

    
        For instance, if I have two NVMe devices, the first time, the first device
is mounted with name /dev/nvme0n1 and the second device with name
/dev/nvme1n1. After node restart, these names can be reversed, that is, the
first device named /dev/nvme1n1 and the second one /dev/nvme0n1 ! The result
is that OSDs no longer find their metadata and do not start up...

      
      This sounds very odd. Could you clarify where block and block.db are?
Also useful here would be to take a look at
/var/log/ceph/ceph-volume-systemd.log and ceph-volume.log to
see how ceph-volume is trying to get this OSD up and running.

Also useful would be to check `ceph-volume lvm list` to verify that
regardless of the name change, it recognizes the correct partition
mapped to the OSD

    
    Oops !
    # ceph-volume lvm
        list

        -->  KeyError: 'devices'

    
    Thank you again,

      Hervé

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com