Re: bluefs-bdev-expand experience

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Yuri,

both issues from Round 2 relate to unsupported expansion for main device.

In fact it doesn't work and silently bypasses the operation in you case.

Please try with a different device...


Also I've just submitted a PR for mimic to indicate the bypass, will backport to Luminous once mimic patch is approved.

See https://github.com/ceph/ceph/pull/27447


Thanks,

Igor

On 4/5/2019 4:07 PM, Yury Shevchuk wrote:
On Fri, Apr 05, 2019 at 02:42:53PM +0300, Igor Fedotov wrote:
wrt Round 1 - an ability to expand block(main) device has been added to
Nautilus,

see: https://github.com/ceph/ceph/pull/25308
Oh, that's good.  But still separate wal&db may be good for studying
load on each volume (blktrace) or moving db/wal to another physical
disk by means of LVM transparently to ceph.

wrt Round 2:

- not setting 'size' label looks like a bug although I recall I fixed it...
Will double check.

- wrong stats output is probably related to the lack of monitor restart -
could you please try that and report back if it helps? Or even restart the
whole cluster.. (well I understand that's a bad approach for production but
just to verify my hypothesis)
Mon restart didn't help:

node0:~# systemctl restart ceph-mon@0
node1:~# systemctl restart ceph-mon@1
node2:~# systemctl restart ceph-mon@2
node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
  0   hdd 0.22739  1.00000  233GiB 9.44GiB 223GiB  4.06 0.12 128
  1   hdd 0.22739  1.00000  233GiB 9.44GiB 223GiB  4.06 0.12 128
  3   hdd 0.22739        0      0B      0B     0B     0    0   0
  2   hdd 0.22739  1.00000  800GiB  409GiB 391GiB 51.18 1.51 128
                     TOTAL 1.24TiB  428GiB 837GiB 33.84
MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30

Restarting mgrs and then all ceph daemons on all three nodes didn't
help either:

node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
  0   hdd 0.22739  1.00000  233GiB 9.43GiB 223GiB  4.05 0.12 128
  1   hdd 0.22739  1.00000  233GiB 9.43GiB 223GiB  4.05 0.12 128
  3   hdd 0.22739        0      0B      0B     0B     0    0   0
  2   hdd 0.22739  1.00000  800GiB  409GiB 391GiB 51.18 1.51 128
                     TOTAL 1.24TiB  428GiB 837GiB 33.84
MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30

Maybe we should upgrade to v14.2.0 Nautilus instead of studying old
bugs... after all, this is a toy cluster for now.

Thank you for responding,


-- Yury

On 4/5/2019 2:06 PM, Yury Shevchuk wrote:
Hello all!

We have a toy 3-node Ceph cluster running Luminous 12.2.11 with one
bluestore osd per node.  We started with pretty small OSDs and would
like to be able to expand OSDs whenever needed.  We had two issues
with the expansion: one turned out user-serviceable while the other
probably needs developers' look.  I will describe both shortly.

Round 1
~~~~~~~
Trying to expand osd.2 by 1TB:

    # lvextend -L+1T /dev/vg0/osd2
      Size of logical volume vg0/osd2 changed from 232.88 GiB (59618 extents) to 1.23 TiB (321762 extents).
      Logical volume vg0/osd2 successfully resized.

    # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2
    inferring bluefs devices from bluestore path
     slot 1 /var/lib/ceph/osd/ceph-2//block
    1 : size 0x13a38800000 : own 0x[1bf2200000~254300000]
    Expanding...
    1 : can't be expanded. Bypassing...
    #

It didn't work.  The explaination can be found in
ceph/src/os/bluestore/BlueFS.cc at line 310:

    // returns true if specified device is under full bluefs control
    // and hence can be expanded
    bool BlueFS::is_device_expandable(unsigned id)
    {
      if (id >= MAX_BDEV || bdev[id] == nullptr) {
        return false;
      }
      switch(id) {
      case BDEV_WAL:
        return true;

      case BDEV_DB:
        // true if DB volume is non-shared
        return bdev[BDEV_SLOW] != nullptr;
      }
      return false;
    }

So we have to use separate block.db and block.wal for OSD to be
expandable.  Indeed, our OSDs were created without separate block.db
and block.wal, like this:

    ceph-volume lvm create --bluestore --data /dev/vg0/osd2

Recreating osd.2 with separate block.db and block.wal:

    # ceph-volume lvm zap --destroy --osd-id 2
    # lvcreate -L1G -n osd2wal vg0
      Logical volume "osd2wal" created.
    # lvcreate -L40G -n osd2db vg0
      Logical volume "osd2db" created.
    # lvcreate -L400G -n osd2 vg0
      Logical volume "osd2" created.
    # ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 --block.db vg0/osd2db --block.wal vg0/osd2wal

Resync takes some time, and then we have expandable osd.2.


Round 2
~~~~~~~
Trying to expand osd.2 from 400G to 700G:

    # lvextend -L+300G /dev/vg0/osd2
      Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 700.00 GiB (179200 extents).
      Logical volume vg0/osd2 successfully resized.

    # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
    inferring bluefs devices from bluestore path
     slot 0 /var/lib/ceph/osd/ceph-2//block.wal
     slot 1 /var/lib/ceph/osd/ceph-2//block.db
     slot 2 /var/lib/ceph/osd/ceph-2//block
    0 : size 0x40000000 : own 0x[1000~3ffff000]
    1 : size 0xa00000000 : own 0x[2000~9ffffe000]
    2 : size 0xaf00000000 : own 0x[3000000000~400000000]
    Expanding...
    #


This time expansion appears to work: 0xaf00000000 = 700GiB.

However, the size in the block device label have not changed:

    # ceph-bluestore-tool show-label --dev /dev/vg0/osd2
    {
        "/dev/vg0/osd2": {
            "osd_uuid": "a18ff7f7-0de1-4791-ba4b-f3b6d2221f44",
            "size": 429496729600,

429496729600 = 400GiB

Worse, ceph osd df shows the added space as used, not available:

# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
   0   hdd 0.22739  1.00000  233GiB 8.06GiB 225GiB  3.46 0.13 128
   1   hdd 0.22739  1.00000  233GiB 8.06GiB 225GiB  3.46 0.13 128
   2   hdd 0.22739  1.00000  700GiB  301GiB 399GiB 43.02 1.58  64
                      TOTAL 1.14TiB  317GiB 849GiB 27.21
MIN/MAX VAR: 0.13/1.58  STDDEV: 21.43

If I expand osd.2 by another 100G, the space also goes to
"USE" column:

node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
   0   hdd 0.22739  1.00000  233GiB 8.05GiB 225GiB  3.46 0.10 128
   1   hdd 0.22739  1.00000  233GiB 8.05GiB 225GiB  3.46 0.10 128
   3   hdd 0.22739        0      0B      0B     0B     0    0   0
   2   hdd 0.22739  1.00000  800GiB  408GiB 392GiB 51.01 1.52 128
                      TOTAL 1.24TiB  424GiB 842GiB 33.51
MIN/MAX VAR: 0.10/1.52  STDDEV: 26.54

So OSD expansion almost works, but not quite.  If you had better luck
with bluefs-bdev-expand, could you please share your story?


-- Yury
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux