Re: ceph-volume: failed to activate some bluestore osds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 7, 2018 at 12:09 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>> >
>> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>> > >
>> > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>> > > > >
>> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>> > > > > > >
>> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote:
>> > > > > > > > Hi all,
>> > > > > > > >
>> > > > > > > > We have an intermittent issue where bluestore osds sometimes fail to
>> > > > > > > > start after a reboot.
>> > > > > > > > The osds all fail the same way [see 2], failing to open the superblock.
>> > > > > > > > One one particular host, there are 24 osds and 4 SSDs partitioned for
>> > > > > > > > the block.db's. The affected non-starting OSDs all have block.db on
>> > > > > > > > the same ssd (/dev/sdaa).
>> > > > > > > >
>> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and were created
>> > > > > > > > by ceph-volume lvm, e.g. see [1].
>> > > > > > > >
>> > > > > > > > This seems like a permissions or similar issue related to the
>> > > > > > > > ceph-volume tooling.
>> > > > > > > > Any clues how to debug this further?
>> > > > > > >
>> > > > > > > I take it the OSDs start up if you try again?
>> > > > > >
>> > > > > > Hey.
>> > > > > > No, they don't. For example, we do this `ceph-volume lvm activate 48
>> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same
>> > > > > > mount failure every time.
>> > > > >
>> > > > > That sounds like a bluefs bug then, not a ceph-volume issue.  Can you
>> > > > > try to start the OSD will logging enabled?  (debug bluefs = 20,
>> > > > > debug bluestore = 20)
>> > > > >
>> > > >
>> > > > Here: https://pastebin.com/TJXZhfcY
>> > > >
>> > > > Is it supposed to print something about the block.db at some point????
>> > >
>> > > Can you dump the bluefs superblock for me?
>> > >
>> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
>> > > hexdump -C /tmp/foo
>> > >
>> >
>> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd
>> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1
>> > 1+0 records in
>> > 1+0 records out
>> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s
>> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C /tmp/foo
>> > 00000000  01 01 5d 00 00 00 11 fb  be 4d 43 31 4a b5 a4 cb  |..]......MC1J...|
>> > 00000010  99 be b7 da 72 ca 99 fd  8e 36 fc 4d 4b bc 83 d9  |....r....6.MK...|
>> > 00000020  f5 e6 11 cd e4 b5 1d 00  00 00 00 00 00 00 00 10  |................|
>> > 00000030  00 00 01 01 2b 00 00 00  01 80 80 40 00 00 00 00  |....+......@....|
>> > 00000040  00 00 00 00 00 02 00 00  00 01 01 07 00 00 00 eb  |................|
>> > 00000050  b2 00 00 83 08 01 01 01  07 00 00 00 cb b2 00 00  |................|
>> > 00000060  83 20 01 61 6d 07 be 00  00 00 00 00 00 00 00 00  |. .am...........|
>> > 00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> > *
>> > 00001000
>> >
>> >
>>
>> Wait, we found something!!!
>>
>> In the 1st 4k on the block we found the block.db pointing at the wrong
>> device (/dev/sdc1 instead of /dev/sdaa1)
>>
>> 00000130  6b 35 79 2b 67 3d 3d 0d  00 00 00 70 61 74 68 5f  |k5y+g==....path_|
>> 00000140  62 6c 6f 63 6b 2e 64 62  09 00 00 00 2f 64 65 76  |block.db..../dev|
>> 00000150  2f 73 64 63 31 05 00 00  00 72 65 61 64 79 05 00  |/sdc1....ready..|
>> 00000160  00 00 72 65 61 64 79 06  00 00 00 77 68 6f 61 6d  |..ready....whoam|
>> 00000170  69 02 00 00 00 34 38 eb  c2 d7 d6 00 00 00 00 00  |i....48.........|
>> 00000180  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>
>> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2
>> instead of /dev/sdaa2).
>> And for the osds that are running, that block.db is correct!
>>
>> So.... the block.db device is persisted in the block header? But after
>> a reboot it gets a new name. (sd* naming is famously chaotic).
>> ceph-volume creates a softlink to the correct db dev, but it seems not used?
>
> Aha, yes.. the bluestore startup code looks for the value in the
> superblock before the on in the directory.
>
> We can either (1) reverse that order, (and/)or (2) make ceph-volume use a
> stable path for the device name when creating the bluestore.  And/or (3)
> use ceph-bluestore-tool set-label-key to fix it if it doesn't match (this
> would repair old superblocks... permanently if we use the stable path
> name).

ceph-volume does not require a stable/persistent path name at all. For
partitions we store the partuuid, and *always* make sure that we have
the right device, because we query blkid.

In addition to that, in the case of bluestore, we go by each device
and ensure that whatever link was done by bluestore is corrected
before attempting to start the OSD [0]

IMO bluestore should do #1, because this is already solved in the
ceph-volume code (we knew dev names could change), but #2 and #3 are
OK to help with this issue today.

Another option would be to just avoid using the partition for block.db
and just use an LV.

[0] https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/activate.py#L155-L168


>
> sage
>
>
>>
>> ...
>> Dan & Teo
>>
>>
>>
>>
>>
>> >
>> > -- dan
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > > Thanks!
>> > > sage
>> > >
>> > > >
>> > > > Here's the osd dir:
>> > > >
>> > > > # ls -l /var/lib/ceph/osd/ceph-48/
>> > > > total 24
>> > > > lrwxrwxrwx. 1 ceph ceph 93 Jun  7 16:46 block ->
>> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > > lrwxrwxrwx. 1 root root 10 Jun  7 16:46 block.db -> /dev/sdaa1
>> > > > -rw-------. 1 ceph ceph 37 Jun  7 16:46 ceph_fsid
>> > > > -rw-------. 1 ceph ceph 37 Jun  7 16:46 fsid
>> > > > -rw-------. 1 ceph ceph 56 Jun  7 16:46 keyring
>> > > > -rw-------. 1 ceph ceph  6 Jun  7 16:46 ready
>> > > > -rw-------. 1 ceph ceph 10 Jun  7 16:46 type
>> > > > -rw-------. 1 ceph ceph  3 Jun  7 16:46 whoami
>> > > >
>> > > > # ls -l /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > > lrwxrwxrwx. 1 root root 7 Jun  7 16:46
>> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > > -> ../dm-4
>> > > >
>> > > > # ls -l /dev/dm-4
>> > > > brw-rw----. 1 ceph ceph 253, 4 Jun  7 16:46 /dev/dm-4
>> > > >
>> > > >
>> > > >   --- Logical volume ---
>> > > >   LV Path
>> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > >   LV Name                osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > >   VG Name                ceph-34f24306-d90c-49ff-bafb-2657a6a18010
>> > > >   LV UUID                FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
>> > > >   LV Write Access        read/write
>> > > >   LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 10:57:37 +0100
>> > > >   LV Status              available
>> > > >   # open                 0
>> > > >   LV Size                <5.46 TiB
>> > > >   Current LE             1430791
>> > > >   Segments               1
>> > > >   Allocation             inherit
>> > > >   Read ahead sectors     auto
>> > > >   - currently set to     256
>> > > >   Block device           253:4
>> > > >
>> > > >   --- Physical volume ---
>> > > >   PV Name               /dev/sda
>> > > >   VG Name               ceph-34f24306-d90c-49ff-bafb-2657a6a18010
>> > > >   PV Size               <5.46 TiB / not usable <2.59 MiB
>> > > >   Allocatable           yes (but full)
>> > > >   PE Size               4.00 MiB
>> > > >   Total PE              1430791
>> > > >   Free PE               0
>> > > >   Allocated PE          1430791
>> > > >   PV UUID               WP0Z7C-ejSh-fpSa-a73N-H2Hz-yC78-qBezcI
>> > > >
>> > > > (sorry for wall o' lvm)
>> > > >
>> > > > -- dan
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > > Thanks!
>> > > > > sage
>> > > > >
>> > > > >
>> > > > > > -- dan
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > sage
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > Thanks!
>> > > > > > > >
>> > > > > > > > Dan
>> > > > > > > >
>> > > > > > > > [1]
>> > > > > > > >
>> > > > > > > > ====== osd.48 ======
>> > > > > > > >
>> > > > > > > >   [block]    /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > > > > > >
>> > > > > > > >       type                      block
>> > > > > > > >       osd id                    48
>> > > > > > > >       cluster fsid              dd535a7e-4647-4bee-853d-f34112615f81
>> > > > > > > >       cluster name              ceph
>> > > > > > > >       osd fsid                  99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > > > > > >       db device                 /dev/sdaa1
>> > > > > > > >       encrypted                 0
>> > > > > > > >       db uuid                   3381a121-1c1b-4e45-a986-c1871c363edc
>> > > > > > > >       cephx lockbox secret
>> > > > > > > >       block uuid                FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC
>> > > > > > > >       block device
>> > > > > > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5
>> > > > > > > >       crush device class        None
>> > > > > > > >
>> > > > > > > >   [  db]    /dev/sdaa1
>> > > > > > > >
>> > > > > > > >       PARTUUID                  3381a121-1c1b-4e45-a986-c1871c363edc
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > [2]
>> > > > > > > >    -11> 2018-06-07 16:12:16.138407 7fba30fb4d80  1 -- - start start
>> > > > > > > >    -10> 2018-06-07 16:12:16.138516 7fba30fb4d80  1
>> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os
>> > > > > > > > d/ceph-48
>> > > > > > > >     -9> 2018-06-07 16:12:16.138801 7fba30fb4d80  1 bdev create path
>> > > > > > > > /var/lib/ceph/osd/ceph-48/block type kernel
>> > > > > > > >     -8> 2018-06-07 16:12:16.138808 7fba30fb4d80  1 bdev(0x55eb46433a00
>> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open path /v
>> > > > > > > > ar/lib/ceph/osd/ceph-48/block
>> > > > > > > >     -7> 2018-06-07 16:12:16.138999 7fba30fb4d80  1 bdev(0x55eb46433a00
>> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open size 60
>> > > > > > > > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 B) rotational
>> > > > > > > >     -6> 2018-06-07 16:12:16.139188 7fba30fb4d80  1
>> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size
>> > > > > > > > 134217728 meta 0.01 kv 0.99 data 0
>> > > > > > > >     -5> 2018-06-07 16:12:16.139275 7fba30fb4d80  1 bdev create path
>> > > > > > > > /var/lib/ceph/osd/ceph-48/block type kernel
>> > > > > > > >     -4> 2018-06-07 16:12:16.139281 7fba30fb4d80  1 bdev(0x55eb46433c00
>> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open path /v
>> > > > > > > > ar/lib/ceph/osd/ceph-48/block
>> > > > > > > >     -3> 2018-06-07 16:12:16.139454 7fba30fb4d80  1 bdev(0x55eb46433c00
>> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open size 60
>> > > > > > > > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 B) rotational
>> > > > > > > >     -2> 2018-06-07 16:12:16.139464 7fba30fb4d80  1 bluefs
>> > > > > > > > add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo
>> > > > > > > > ck size 5589 GB
>> > > > > > > >     -1> 2018-06-07 16:12:16.139510 7fba30fb4d80  1 bluefs mount
>> > > > > > > >      0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1
>> > > > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA
>> > > > > > > > BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o
>> > > > > > > > s/bluestore/bluefs_types.h: In function 'static void
>> > > > > > > > bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8
>> > > > > > > > *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07
>> > > > > > > > 16:12:16.139666
>> > > > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h:
>> > > > > > > > 54: FAILED assert(pos <= end)
>> > > > > > > >
>> > > > > > > >  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
>> > > > > > > > luminous (stable)
>> > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> > > > > > > > const*)+0x110) [0x55eb3b597780]
>> > > > > > > >  2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776)
>> > > > > > > > [0x55eb3b52db36]
>> > > > > > > >  3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede]
>> > > > > > > >  4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3]
>> > > > > > > >  5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd]
>> > > > > > > >  6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e]
>> > > > > > > >  7: (OSD::init()+0x3bd) [0x55eb3b02a1cd]
>> > > > > > > >  8: (main()+0x2d07) [0x55eb3af2f977]
>> > > > > > > >  9: (__libc_start_main()+0xf5) [0x7fba2d47b445]
>> > > > > > > >  10: (()+0x4b7033) [0x55eb3afce033]
>> > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> > > > > > > > needed to interpret this.
>> > > > > > > > _______________________________________________
>> > > > > > > > ceph-users mailing list
>> > > > > > > > ceph-users@xxxxxxxxxxxxxx
>> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > > > >
>> > > > > > > >
>> > > > > > _______________________________________________
>> > > > > > ceph-users mailing list
>> > > > > > ceph-users@xxxxxxxxxxxxxx
>> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > >
>> > > > > >
>> > > >
>> > > >
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux