Re: bluefs-bdev-expand experience

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor!

I have upgraded from Luminous to Nautilus and now slow device
expansion works indeed.  The steps are shown below to round up the
topic.

node2# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL   %USE  VAR  PGS STATUS
 0   hdd 0.22739  1.00000 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 38.92 1.04 128     up
 1   hdd 0.22739  1.00000 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 38.92 1.04 128     up
 3   hdd 0.22739        0     0 B     0 B     0 B     0 B     0 B     0 B     0    0   0   down
 2   hdd 0.22739  1.00000 481 GiB 172 GiB  90 GiB 201 MiB 823 MiB 309 GiB 35.70 0.96 128     up
                    TOTAL 947 GiB 353 GiB 269 GiB 610 MiB 2.4 GiB 594 GiB 37.28
MIN/MAX VAR: 0.96/1.04  STDDEV: 1.62

node2# lvextend -L+50G /dev/vg0/osd2
  Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 450.00 GiB (115200 extents).
  Logical volume vg0/osd2 successfully resized.

node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
2019-04-11 22:28:00.240 7f2e24e190c0 -1 bluestore(/var/lib/ceph/osd/ceph-2) _lock_fsid failed to lock /var/lib/ceph/osd/ceph-2/fsid (is another ceph-osd still running?)(11) Resource temporarily unavailable
...
*** Caught signal (Aborted) **
[two pages of stack dump stripped]

My mistake in the first place: I tried to expand non-stopped osd again.

node2# systemctl stop ceph-osd.target

node2# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
inferring bluefs devices from bluestore path
0 : device size 0x40000000 : own 0x[1000~3ffff000] = 0x3ffff000 : using 0x8ff000
1 : device size 0x1440000000 : own 0x[2000~143fffe000] = 0x143fffe000 : using 0x24dfe000
2 : device size 0x7080000000 : own 0x[3000000000~400000000] = 0x400000000 : using 0x0
Expanding...
2 : expanding  from 0x6400000000 to 0x7080000000
2 : size label updated to 483183820800

node2# ceph-bluestore-tool show-label --dev /dev/vg0/osd2 | grep size
        "size": 483183820800,

node2# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL   %USE  VAR  PGS STATUS
 0   hdd 0.22739  1.00000 233 GiB  91 GiB  90 GiB 208 MiB 816 MiB 142 GiB 38.92 1.10 128     up
 1   hdd 0.22739  1.00000 233 GiB  91 GiB  90 GiB 200 MiB 824 MiB 142 GiB 38.92 1.10 128     up
 3   hdd 0.22739        0     0 B     0 B     0 B     0 B     0 B     0 B     0    0   0   down
 2   hdd 0.22739  1.00000 531 GiB 172 GiB  90 GiB 185 MiB 839 MiB 359 GiB 32.33 0.91 128     up
                    TOTAL 997 GiB 353 GiB 269 GiB 593 MiB 2.4 GiB 644 GiB 35.41
MIN/MAX VAR: 0.91/1.10  STDDEV: 3.37

It worked: AVAIL = 594+50 = 644.  Great!
Thanks a lot for your help.

And one more question regarding your last remark is inline below.

On Wed, Apr 10, 2019 at 09:54:35PM +0300, Igor Fedotov wrote:
>
> On 4/9/2019 1:59 PM, Yury Shevchuk wrote:
> > Igor, thank you, Round 2 is explained now.
> >
> > Main aka block aka slow device cannot be expanded in Luminus, this
> > functionality will be available after upgrade to Nautilus.
> > Wal and db devices can be expanded in Luminous.
> >
> > Now I have recreated osd2 once again to get rid of the paradoxical
> > cepf osd df output and tried to test db expansion, 40G -> 60G:
> >
> > node2:/# ceph-volume lvm zap --destroy --osd-id 2
> > node2:/# ceph osd lost 2 --yes-i-really-mean-it
> > node2:/# ceph osd destroy 2 --yes-i-really-mean-it
> > node2:/# lvcreate -L1G -n osd2wal vg0
> > node2:/# lvcreate -L40G -n osd2db vg0
> > node2:/# lvcreate -L400G -n osd2 vg0
> > node2:/# ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 --block.db vg0/osd2db --block.wal vg0/osd2wal
> >
> > node2:/# ceph osd df
> > ID CLASS WEIGHT  REWEIGHT SIZE   USE     AVAIL  %USE VAR  PGS
> >   0   hdd 0.22739  1.00000 233GiB 9.49GiB 223GiB 4.08 1.24 128
> >   1   hdd 0.22739  1.00000 233GiB 9.49GiB 223GiB 4.08 1.24 128
> >   3   hdd 0.22739        0     0B      0B     0B    0    0   0
> >   2   hdd 0.22739  1.00000 400GiB 9.49GiB 391GiB 2.37 0.72 128
> >                      TOTAL 866GiB 28.5GiB 837GiB 3.29
> > MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83
> >
> > node2:/# lvextend -L+20G /dev/vg0/osd2db
> >    Size of logical volume vg0/osd2db changed from 40.00 GiB (10240 extents) to 60.00 GiB (15360 extents).
> >    Logical volume vg0/osd2db successfully resized.
> >
> > node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
> > inferring bluefs devices from bluestore path
> >   slot 0 /var/lib/ceph/osd/ceph-2//block.wal
> >   slot 1 /var/lib/ceph/osd/ceph-2//block.db
> >   slot 2 /var/lib/ceph/osd/ceph-2//block
> > 0 : size 0x40000000 : own 0x[1000~3ffff000]
> > 1 : size 0xf00000000 : own 0x[2000~9ffffe000]
> > 2 : size 0x6400000000 : own 0x[3000000000~400000000]
> > Expanding...
> > 1 : expanding  from 0xa00000000 to 0xf00000000
> > 1 : size label updated to 64424509440
> >
> > node2:/# ceph-bluestore-tool show-label --dev /dev/vg0/osd2db | grep size
> >          "size": 64424509440,
> >
> > The label updated correctly, but ceph osd df have not changed.
> > I expected to see 391 + 20 = 411GiB in AVAIL column, but it stays at 391:
> >
> > node2:/# ceph osd df
> > ID CLASS WEIGHT  REWEIGHT SIZE   USE     AVAIL  %USE VAR  PGS
> >   0   hdd 0.22739  1.00000 233GiB 9.50GiB 223GiB 4.08 1.24 128
> >   1   hdd 0.22739  1.00000 233GiB 9.50GiB 223GiB 4.08 1.24 128
> >   3   hdd 0.22739        0     0B      0B     0B    0    0   0
> >   2   hdd 0.22739  1.00000 400GiB 9.49GiB 391GiB 2.37 0.72 128
> >                      TOTAL 866GiB 28.5GiB 837GiB 3.29
> > MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83
> >
> > I have restarted monitors on all three nodes, 391GiB stays intact.
> >
> > OK, but I used bluefs-bdev-expand on running OSD... probably not good,
> > it seems to fork by opening bluefs directly... trying once again:
> >
> > node2:/# systemctl stop ceph-osd@2
> >
> > node2:/# lvextend -L+20G /dev/vg0/osd2db
> >    Size of logical volume vg0/osd2db changed from 60.00 GiB (15360 extents) to 80.00 GiB (20480 extents).
> >    Logical volume vg0/osd2db successfully resized.
> >
> > node2:/# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
> > inferring bluefs devices from bluestore path
> >   slot 0 /var/lib/ceph/osd/ceph-2//block.wal
> >   slot 1 /var/lib/ceph/osd/ceph-2//block.db
> >   slot 2 /var/lib/ceph/osd/ceph-2//block
> > 0 : size 0x40000000 : own 0x[1000~3ffff000]
> > 1 : size 0x1400000000 : own 0x[2000~9ffffe000]
> > 2 : size 0x6400000000 : own 0x[3000000000~400000000]
> > Expanding...
> > 1 : expanding  from 0xa00000000 to 0x1400000000
> > 1 : size label updated to 85899345920
> >
> > node2:/# systemctl start ceph-osd@2
> > node2:/# systemctl restart ceph-mon@pier42
> >
> > node2:/# ceph osd df
> > ID CLASS WEIGHT  REWEIGHT SIZE   USE     AVAIL  %USE VAR  PGS
> >   0   hdd 0.22739  1.00000 233GiB 9.49GiB 223GiB 4.08 1.24 128
> >   1   hdd 0.22739  1.00000 233GiB 9.50GiB 223GiB 4.08 1.24 128
> >   3   hdd 0.22739        0     0B      0B     0B    0    0   0
> >   2   hdd 0.22739  1.00000 400GiB 9.50GiB 391GiB 2.37 0.72   0
> >                      TOTAL 866GiB 28.5GiB 837GiB 3.29
> > MIN/MAX VAR: 0.72/1.24  STDDEV: 0.83
> >
> > Something is wrong.  Maybe I was wrong expecting db change to appear
> > in AVAIL column?  From Bluestore description I got db and slow should
> > sum up, no?
>
> It was a while ago when db and slow were summed to provide total store size.
> In the latest Luminous releases that's not true anymore. Ceph uses slow
> device space to report SIZE/AVAIL only . There is some adjustment for BlueFS
> part residing on the slow device but DB device is definitely out of
> calculation here for now.
>
> You can also note that reported SIZE for osd.2 is 400GiB in your case which
> is absolutely inline with slow device capacity.

You are absolutely right, I could have noticed that myself...

> Hence no DB involved.

I used this (admittedly not very recent) message as a guide for
volume sizing:

  https://www.spinics.net/lists/ceph-devel/msg37804.html

It reads: "1GB for block.wal.  For block.db, as much as you have."

I take it as follows:

1. block.db should be large enough to contain the active set, so the
slow aka block device is not touched often;

2. block.db + block should be large enough to contain all data
destined to this OSD by CRUSH.

That is, setups with block.db > block, or even block = 0 are possible
and useful.  If that is correct, I cannot understand why only block
size is shown in ceph osd df AVAIL, while block.db is shown nowhere but
ceph-bluestore-tool show-label --dev /var/lib/ceph/osd/*/block.db.
But understanding "how it is" is precious too :)

Regards,


-- Yury

PS I have written too much, out of letters :)  Will follow up no more.


> > On Mon, Apr 08, 2019 at 10:17:24PM +0300, Igor Fedotov wrote:
> > > Hi Yuri,
> > >
> > > both issues from Round 2 relate to unsupported expansion for main device.
> > >
> > > In fact it doesn't work and silently bypasses the operation in you case.
> > >
> > > Please try with a different device...
> > >
> > >
> > > Also I've just submitted a PR for mimic to indicate the bypass, will
> > > backport to Luminous once mimic patch is approved.
> > >
> > > See https://github.com/ceph/ceph/pull/27447
> > >
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > > On 4/5/2019 4:07 PM, Yury Shevchuk wrote:
> > > > On Fri, Apr 05, 2019 at 02:42:53PM +0300, Igor Fedotov wrote:
> > > > > wrt Round 1 - an ability to expand block(main) device has been added to
> > > > > Nautilus,
> > > > >
> > > > > see: https://github.com/ceph/ceph/pull/25308
> > > > Oh, that's good.  But still separate wal&db may be good for studying
> > > > load on each volume (blktrace) or moving db/wal to another physical
> > > > disk by means of LVM transparently to ceph.
> > > >
> > > > > wrt Round 2:
> > > > >
> > > > > - not setting 'size' label looks like a bug although I recall I fixed it...
> > > > > Will double check.
> > > > >
> > > > > - wrong stats output is probably related to the lack of monitor restart -
> > > > > could you please try that and report back if it helps? Or even restart the
> > > > > whole cluster.. (well I understand that's a bad approach for production but
> > > > > just to verify my hypothesis)
> > > > Mon restart didn't help:
> > > >
> > > > node0:~# systemctl restart ceph-mon@0
> > > > node1:~# systemctl restart ceph-mon@1
> > > > node2:~# systemctl restart ceph-mon@2
> > > > node2:~# ceph osd df
> > > > ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
> > > >    0   hdd 0.22739  1.00000  233GiB 9.44GiB 223GiB  4.06 0.12 128
> > > >    1   hdd 0.22739  1.00000  233GiB 9.44GiB 223GiB  4.06 0.12 128
> > > >    3   hdd 0.22739        0      0B      0B     0B     0    0   0
> > > >    2   hdd 0.22739  1.00000  800GiB  409GiB 391GiB 51.18 1.51 128
> > > >                       TOTAL 1.24TiB  428GiB 837GiB 33.84
> > > > MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30
> > > >
> > > > Restarting mgrs and then all ceph daemons on all three nodes didn't
> > > > help either:
> > > >
> > > > node2:~# ceph osd df
> > > > ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
> > > >    0   hdd 0.22739  1.00000  233GiB 9.43GiB 223GiB  4.05 0.12 128
> > > >    1   hdd 0.22739  1.00000  233GiB 9.43GiB 223GiB  4.05 0.12 128
> > > >    3   hdd 0.22739        0      0B      0B     0B     0    0   0
> > > >    2   hdd 0.22739  1.00000  800GiB  409GiB 391GiB 51.18 1.51 128
> > > >                       TOTAL 1.24TiB  428GiB 837GiB 33.84
> > > > MIN/MAX VAR: 0.12/1.51  STDDEV: 26.30
> > > >
> > > > Maybe we should upgrade to v14.2.0 Nautilus instead of studying old
> > > > bugs... after all, this is a toy cluster for now.
> > > >
> > > > Thank you for responding,
> > > >
> > > >
> > > > -- Yury
> > > >
> > > > > On 4/5/2019 2:06 PM, Yury Shevchuk wrote:
> > > > > > Hello all!
> > > > > >
> > > > > > We have a toy 3-node Ceph cluster running Luminous 12.2.11 with one
> > > > > > bluestore osd per node.  We started with pretty small OSDs and would
> > > > > > like to be able to expand OSDs whenever needed.  We had two issues
> > > > > > with the expansion: one turned out user-serviceable while the other
> > > > > > probably needs developers' look.  I will describe both shortly.
> > > > > >
> > > > > > Round 1
> > > > > > ~~~~~~~
> > > > > > Trying to expand osd.2 by 1TB:
> > > > > >
> > > > > >      # lvextend -L+1T /dev/vg0/osd2
> > > > > >        Size of logical volume vg0/osd2 changed from 232.88 GiB (59618 extents) to 1.23 TiB (321762 extents).
> > > > > >        Logical volume vg0/osd2 successfully resized.
> > > > > >
> > > > > >      # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2
> > > > > >      inferring bluefs devices from bluestore path
> > > > > >       slot 1 /var/lib/ceph/osd/ceph-2//block
> > > > > >      1 : size 0x13a38800000 : own 0x[1bf2200000~254300000]
> > > > > >      Expanding...
> > > > > >      1 : can't be expanded. Bypassing...
> > > > > >      #
> > > > > >
> > > > > > It didn't work.  The explaination can be found in
> > > > > > ceph/src/os/bluestore/BlueFS.cc at line 310:
> > > > > >
> > > > > >      // returns true if specified device is under full bluefs control
> > > > > >      // and hence can be expanded
> > > > > >      bool BlueFS::is_device_expandable(unsigned id)
> > > > > >      {
> > > > > >        if (id >= MAX_BDEV || bdev[id] == nullptr) {
> > > > > >          return false;
> > > > > >        }
> > > > > >        switch(id) {
> > > > > >        case BDEV_WAL:
> > > > > >          return true;
> > > > > >
> > > > > >        case BDEV_DB:
> > > > > >          // true if DB volume is non-shared
> > > > > >          return bdev[BDEV_SLOW] != nullptr;
> > > > > >        }
> > > > > >        return false;
> > > > > >      }
> > > > > >
> > > > > > So we have to use separate block.db and block.wal for OSD to be
> > > > > > expandable.  Indeed, our OSDs were created without separate block.db
> > > > > > and block.wal, like this:
> > > > > >
> > > > > >      ceph-volume lvm create --bluestore --data /dev/vg0/osd2
> > > > > >
> > > > > > Recreating osd.2 with separate block.db and block.wal:
> > > > > >
> > > > > >      # ceph-volume lvm zap --destroy --osd-id 2
> > > > > >      # lvcreate -L1G -n osd2wal vg0
> > > > > >        Logical volume "osd2wal" created.
> > > > > >      # lvcreate -L40G -n osd2db vg0
> > > > > >        Logical volume "osd2db" created.
> > > > > >      # lvcreate -L400G -n osd2 vg0
> > > > > >        Logical volume "osd2" created.
> > > > > >      # ceph-volume lvm create --osd-id 2 --bluestore --data vg0/osd2 --block.db vg0/osd2db --block.wal vg0/osd2wal
> > > > > >
> > > > > > Resync takes some time, and then we have expandable osd.2.
> > > > > >
> > > > > >
> > > > > > Round 2
> > > > > > ~~~~~~~
> > > > > > Trying to expand osd.2 from 400G to 700G:
> > > > > >
> > > > > >      # lvextend -L+300G /dev/vg0/osd2
> > > > > >        Size of logical volume vg0/osd2 changed from 400.00 GiB (102400 extents) to 700.00 GiB (179200 extents).
> > > > > >        Logical volume vg0/osd2 successfully resized.
> > > > > >
> > > > > >      # ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-2/
> > > > > >      inferring bluefs devices from bluestore path
> > > > > >       slot 0 /var/lib/ceph/osd/ceph-2//block.wal
> > > > > >       slot 1 /var/lib/ceph/osd/ceph-2//block.db
> > > > > >       slot 2 /var/lib/ceph/osd/ceph-2//block
> > > > > >      0 : size 0x40000000 : own 0x[1000~3ffff000]
> > > > > >      1 : size 0xa00000000 : own 0x[2000~9ffffe000]
> > > > > >      2 : size 0xaf00000000 : own 0x[3000000000~400000000]
> > > > > >      Expanding...
> > > > > >      #
> > > > > >
> > > > > >
> > > > > > This time expansion appears to work: 0xaf00000000 = 700GiB.
> > > > > >
> > > > > > However, the size in the block device label have not changed:
> > > > > >
> > > > > >      # ceph-bluestore-tool show-label --dev /dev/vg0/osd2
> > > > > >      {
> > > > > >          "/dev/vg0/osd2": {
> > > > > >              "osd_uuid": "a18ff7f7-0de1-4791-ba4b-f3b6d2221f44",
> > > > > >              "size": 429496729600,
> > > > > >
> > > > > > 429496729600 = 400GiB
> > > > > >
> > > > > > Worse, ceph osd df shows the added space as used, not available:
> > > > > >
> > > > > > # ceph osd df
> > > > > > ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
> > > > > >     0   hdd 0.22739  1.00000  233GiB 8.06GiB 225GiB  3.46 0.13 128
> > > > > >     1   hdd 0.22739  1.00000  233GiB 8.06GiB 225GiB  3.46 0.13 128
> > > > > >     2   hdd 0.22739  1.00000  700GiB  301GiB 399GiB 43.02 1.58  64
> > > > > >                        TOTAL 1.14TiB  317GiB 849GiB 27.21
> > > > > > MIN/MAX VAR: 0.13/1.58  STDDEV: 21.43
> > > > > >
> > > > > > If I expand osd.2 by another 100G, the space also goes to
> > > > > > "USE" column:
> > > > > >
> > > > > > node2:~# ceph osd df
> > > > > > ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL  %USE  VAR  PGS
> > > > > >     0   hdd 0.22739  1.00000  233GiB 8.05GiB 225GiB  3.46 0.10 128
> > > > > >     1   hdd 0.22739  1.00000  233GiB 8.05GiB 225GiB  3.46 0.10 128
> > > > > >     3   hdd 0.22739        0      0B      0B     0B     0    0   0
> > > > > >     2   hdd 0.22739  1.00000  800GiB  408GiB 392GiB 51.01 1.52 128
> > > > > >                        TOTAL 1.24TiB  424GiB 842GiB 33.51
> > > > > > MIN/MAX VAR: 0.10/1.52  STDDEV: 26.54
> > > > > >
> > > > > > So OSD expansion almost works, but not quite.  If you had better luck
> > > > > > with bluefs-bdev-expand, could you please share your story?
> > > > > >
> > > > > >
> > > > > > -- Yury
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux