Re: bluefs-bdev-expand experience

Yury Shevchuk <sizif@xxxxxxxx> · Fri, 5 Apr 2019 16:07:41 +0300

On Fri, Apr 05, 2019 at 02:42:53PM +0300, Igor Fedotov wrote:
> wrt Round 1 - an ability to expand block(main) device has been added to
> Nautilus,
>
> see: https://github.com/ceph/ceph/pull/25308

Oh, that's good.  But still separate wal&db may be good for studying
load on each volume (blktrace) or moving db/wal to another physical
disk by means of LVM transparently to ceph.

> wrt Round 2:
>
> - not setting 'size' label looks like a bug although I recall I fixed it...
> Will double check.
>
> - wrong stats output is probably related to the lack of monitor restart -
> could you please try that and report back if it helps? Or even restart the
> whole cluster.. (well I understand that's a bad approach for production but
> just to verify my hypothesis)

Mon restart didn't help:

node0:~# systemctl restart ceph-mon@0
node1:~# systemctl restart ceph-mon@1
node2:~# systemctl restart ceph-mon@2
node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
 0   hdd 0.22739  1.00000  233GiB 9.44GiB 223GiB  4.06 0.12 128
 1   hdd 0.22739  1.00000  233GiB 9.44GiB 223GiB  4.06 0.12 128
 3   hdd 0.22739        0      0B      0B     0B     0    0   0
 2   hdd 0.22739  1.00000  800GiB  409GiB 391GiB 51.18 1.51 128
                    TOTAL 1.24TiB  428GiB 837GiB 33.84
MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30

Restarting mgrs and then all ceph daemons on all three nodes didn't
help either:

node2:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
 0   hdd 0.22739  1.00000  233GiB 9.43GiB 223GiB  4.05 0.12 128
 1   hdd 0.22739  1.00000  233GiB 9.43GiB 223GiB  4.05 0.12 128
 3   hdd 0.22739        0      0B      0B     0B     0    0   0
 2   hdd 0.22739  1.00000  800GiB  409GiB 391GiB 51.18 1.51 128
                    TOTAL 1.24TiB  428GiB 837GiB 33.84
MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30

Maybe we should upgrade to v14.2.0 Nautilus instead of studying old
bugs... after all, this is a toy cluster for now.

Thank you for responding,

-- Yury

> On 4/5/2019 2:06 PM, Yury Shevchuk wrote:
> > Hello all!
> >
> > We have a toy 3-node Ceph cluster running Luminous 12.2.11 with one
> > bluestore osd per node.  We started with pretty small OSDs and would
> > like to be able to expand OSDs whenever needed.  We had two issues
> > with the expansion: one turned out user-serviceable while the other
> > probably needs developers' look.  I will describe both shortly.
> >
> > Round 1
> > ~~~~~~~
> > Trying to expand osd.2 by 1TB:
> >
> >    # lvextend -L+1T /dev/vg0/osd2
> >      Size of logical volume vg0/osd2 changed from 232.88 GiB (59618 extents) to 1.23 TiB (321762 extents).
> >      Logical volume vg0/osd2 successfully resized.
> >
> >    # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2
> >    inferring bluefs devices from bluestore path
> >     slot 1 /var/lib/ceph/osd/ceph-2//block
> >    1 : size 0x13a38800000 : own 0x[1bf2200000~254300000]
> >    Expanding...
> >    1 : can't be expanded. Bypassing...
> >    #
> >
> > It didn't work.  The explaination can be found in
> > ceph/src/os/bluestore/BlueFS.cc at line 310:
> >
> >    // returns true if specified device is under full bluefs control
> >    // and hence can be expanded
> >    bool BlueFS::is_device_expandable(unsigned id)
> >    {
> >      if (id >= MAX_BDEV || bdev[id] == nullptr) {
> >        return false;
> >      }
> >      switch(id) {
> >      case BDEV_WAL:
> >        return true;
> >
> >      case BDEV_DB:
> >        // true if DB volume is non-shared
> >        return bdev[BDEV_SLOW] != nullptr;
> >      }
> >      return false;
> >    }
> >
> > So we have to use separate block.db and block.wal for OSD to be
> > expandable.  Indeed, our OSDs were created without separate block.db
> > and block.wal, like this:
> >
> >    ceph-volume lvm create --bluestore --data /dev/vg0/osd2
> >
> > Recreating osd.2 with separate block.db and block.wal:
> >
> >    # ceph-volume lvm zap --destroy --osd-id 2
> >    # lvcreate -L1G -n osd2wal vg0
> >      Logical volume "osd2wal" created.
> >    # lvcreate -L40G -n osd2db vg0
> >      Logical volume "osd2db" created.
> >    # lvcreate -L400G -n osd2 vg0
> >      Logical volume "osd2" created.
> >    # ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 --block.db vg0/osd2db --block.wal vg0/osd2wal
> >
> > Resync takes some time, and then we have expandable osd.2.
> >
> >
> > Round 2
> > ~~~~~~~
> > Trying to expand osd.2 from 400G to 700G:
> >
> >    # lvextend -L+300G /dev/vg0/osd2
> >      Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 700.00 GiB (179200 extents).
> >      Logical volume vg0/osd2 successfully resized.
> >
> >    # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
> >    inferring bluefs devices from bluestore path
> >     slot 0 /var/lib/ceph/osd/ceph-2//block.wal
> >     slot 1 /var/lib/ceph/osd/ceph-2//block.db
> >     slot 2 /var/lib/ceph/osd/ceph-2//block
> >    0 : size 0x40000000 : own 0x[1000~3ffff000]
> >    1 : size 0xa00000000 : own 0x[2000~9ffffe000]
> >    2 : size 0xaf00000000 : own 0x[3000000000~400000000]
> >    Expanding...
> >    #
> >
> >
> > This time expansion appears to work: 0xaf00000000 = 700GiB.
> >
> > However, the size in the block device label have not changed:
> >
> >    # ceph-bluestore-tool show-label --dev /dev/vg0/osd2
> >    {
> >        "/dev/vg0/osd2": {
> >            "osd_uuid": "a18ff7f7-0de1-4791-ba4b-f3b6d2221f44",
> >            "size": 429496729600,
> >
> > 429496729600 = 400GiB
> >
> > Worse, ceph osd df shows the added space as used, not available:
> >
> > # ceph osd df
> > ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
> >   0   hdd 0.22739  1.00000  233GiB 8.06GiB 225GiB  3.46 0.13 128
> >   1   hdd 0.22739  1.00000  233GiB 8.06GiB 225GiB  3.46 0.13 128
> >   2   hdd 0.22739  1.00000  700GiB  301GiB 399GiB 43.02 1.58  64
> >                      TOTAL 1.14TiB  317GiB 849GiB 27.21
> > MIN/MAX VAR: 0.13/1.58  STDDEV: 21.43
> >
> > If I expand osd.2 by another 100G, the space also goes to
> > "USE" column:
> >
> > node2:~# ceph osd df
> > ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
> >   0   hdd 0.22739  1.00000  233GiB 8.05GiB 225GiB  3.46 0.10 128
> >   1   hdd 0.22739  1.00000  233GiB 8.05GiB 225GiB  3.46 0.10 128
> >   3   hdd 0.22739        0      0B      0B     0B     0    0   0
> >   2   hdd 0.22739  1.00000  800GiB  408GiB 392GiB 51.01 1.52 128
> >                      TOTAL 1.24TiB  424GiB 842GiB 33.51
> > MIN/MAX VAR: 0.10/1.52  STDDEV: 26.54
> >
> > So OSD expansion almost works, but not quite.  If you had better luck
> > with bluefs-bdev-expand, could you please share your story?
> >
> >
> > -- Yury
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com