Re: `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the late reply,

Here's what I did this time around. osd.0 and osd.1 should be identical, except osd.0 was recreated (that's the first one that failed) and I'm trying to expand osd.1 from its original size.

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0 | grep size
        "size": 4000780910592,
# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 | grep size
        "size": 3957831237632,
# blockdev --getsize64 /var/lib/ceph/osd/ceph-0/block
4000780910592
# blockdev --getsize64 /var/lib/ceph/osd/ceph-1/block
4000780910592

As you can see the osd.1 block device is already resized

# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-0
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-0/block
1 : size 0x3a381200000 : own 0x[1bf1f400000~2542a00000]
# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
1 : size 0x3a381200000 : own 0x[1ba52700000~24dc400000]

# ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
start:
1 : size 0x3a381200000 : own 0x[1ba52700000~24dc400000]
expanding dev 1 from 0x1df2eb00000 to 0x3a381200000
Can't find device path for dev 1

Unfortunately I forgot to run this with debugging enabled.

This seems like it didn't touch the first 8K (label), so unfortunately I cannot undo it. I guess this information is stored elsewhere.

I did notice that the size label was not updated, so I updated it manually:

# ceph-bluestore-tool set-label-key --dev /var/lib/ceph/osd/ceph-1/block --key size --value 4000780910592

# ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1 | grep size
        "size": 4000780910592,

This is what bluefs-bdev-sizes says:

# ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-1
inferring bluefs devices from bluestore path
 slot 1 /var/lib/ceph/osd/ceph-1/block
1 : size 0x3a381200000 : own 0x[1ba52700000~1e92eb00000]

fsck reported "fsck succcess". The log is massive, I can host it somewhere if needed.

Starting the OSD fails with:

# ceph-osd --id 1
2019-01-11 18:51:00.745 7f06a72c62c0 -1 Public network was set, but cluster network was not set 2019-01-11 18:51:00.745 7f06a72c62c0 -1 Using public network also for cluster network starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal 2019-01-11 18:51:08.902 7f06a72c62c0 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[1df2eb00000~1c452700000] 2019-01-11 18:51:09.301 7f06a72c62c0 -1 osd.1 0 OSD:init: unable to mount object store 2019-01-11 18:51:09.301 7f06a72c62c0 -1 ** ERROR: osd init failed: (5) Input/output error

That "bluefs extra" line seems to be the issue. From a full log:

2019-01-11 18:56:00.135 7fb74a8272c0 10 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace
2019-01-11 18:56:00.135 7fb74a8272c0 10 bluefs get_block_extents bdev 1
2019-01-11 18:56:00.135 7fb74a8272c0 10 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs says 0x[1ba52700000~1e92eb00000] 2019-01-11 18:56:00.135 7fb74a8272c0 10 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace super says 0x[1ba52700000~24dc400000] 2019-01-11 18:56:00.135 7fb74a8272c0 -1 bluestore(/var/lib/ceph/osd/ceph-1) _reconcile_bluefs_freespace bluefs extra 0x[1df2eb00000~1c452700000] 2019-01-11 18:56:00.135 7fb74a8272c0 10 bluestore(/var/lib/ceph/osd/ceph-1) _flush_cache

And that is where the -EIO is coming from:
https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5305

So I guess there is an inconsistency between some metadata here?

On 27/12/2018 20:46, Igor Fedotov wrote:
Hector,

One more thing to mention - after expansion please run fsck using
ceph-bluestore-tool prior to running osd daemon and collect another log
using CEPH_ARGS variable.


Thanks,

Igor

On 12/27/2018 2:41 PM, Igor Fedotov wrote:
Hi Hector,

I've never tried bluefs-bdev-expand over encrypted volumes but it
works absolutely fine for me in other cases.

So it would be nice to troubleshoot this a bit.

Suggest to do the following:

1) Backup first 8K for all OSD.1 devices (block, db and wal) using dd.
This will probably allow to recover from the failed expansion and
repeat it multiple times.

2) Collect current volume sizes with bluefs-bdev-sizes command and
actual devices sizes using 'lsblk --bytes'.

3) Do lvm volume expansion and then collect dev sizes with 'lsblk
--bytes' once again

4) Invoke bluefs-bdev-expand for osd.1 with
CEPH_ARGS="--debug-bluestore 20 --debug-bluefs 20 --log-file
bluefs-bdev-expand.log"

Perhaps it makes sense to open a ticket at ceph bug tracker to proceed...


Thanks,

Igor




On 12/27/2018 12:19 PM, Hector Martin wrote:
Hi list,

I'm slightly expanding the underlying LV for two OSDs and figured I
could use ceph-bluestore-tool to avoid having to re-create them from
scratch.

I first shut down the OSD, expanded the LV, and then ran:
ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0

I forgot I was using encryption, so the overlying dm-crypt mapping
stayed the same when I resized the underlying LV. I was surprised by
the output of ceph-bluestore-tool, which suggested a size change by a
significant amount (I was changing the LV size only by a few
percent). I then checked the underlying `block` device and realized
its size had not changed, so the command should've been a no-op. I
then tried to restart the OSD, and it failed with an I/O error. I
ended up re-creating that OSD and letting it recover.

I have another OSD (osd.1) in the original state where I could run
this test again if needed. Unfortunately I don't have the output of
the first test any more.

Is `ceph-bluestore-tool bluefs-bdev-expand` supposed to work? I get
the feeling it gets the size wrong and corrupts OSDs by expanding it
too much. If this is indeed supposed to work I would be happy to test
this again with osd.1 if needed and see if I can get it fixed.
Otherwise I'll just re-create it and move on.

# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic
(stable)

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Hector Martin (hector@xxxxxxxxxxxxxx)
Public Key: https://marcan.st/marcan.asc
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux