On Thu, Jun 7, 2018 at 2:45 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > On Thu, Jun 7, 2018 at 6:58 PM Alfredo Deza <adeza@xxxxxxxxxx> wrote: >> >> On Thu, Jun 7, 2018 at 12:09 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> > On Thu, 7 Jun 2018, Dan van der Ster wrote: >> >> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> >> > >> >> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> > > >> >> > > On Thu, 7 Jun 2018, Dan van der Ster wrote: >> >> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> > > > > >> >> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote: >> >> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> > > > > > > >> >> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote: >> >> > > > > > > > Hi all, >> >> > > > > > > > >> >> > > > > > > > We have an intermittent issue where bluestore osds sometimes fail to >> >> > > > > > > > start after a reboot. >> >> > > > > > > > The osds all fail the same way [see 2], failing to open the superblock. >> >> > > > > > > > One one particular host, there are 24 osds and 4 SSDs partitioned for >> >> > > > > > > > the block.db's. The affected non-starting OSDs all have block.db on >> >> > > > > > > > the same ssd (/dev/sdaa). >> >> > > > > > > > >> >> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and were created >> >> > > > > > > > by ceph-volume lvm, e.g. see [1]. >> >> > > > > > > > >> >> > > > > > > > This seems like a permissions or similar issue related to the >> >> > > > > > > > ceph-volume tooling. >> >> > > > > > > > Any clues how to debug this further? >> >> > > > > > > >> >> > > > > > > I take it the OSDs start up if you try again? >> >> > > > > > >> >> > > > > > Hey. >> >> > > > > > No, they don't. For example, we do this `ceph-volume lvm activate 48 >> >> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same >> >> > > > > > mount failure every time. >> >> > > > > >> >> > > > > That sounds like a bluefs bug then, not a ceph-volume issue. Can you >> >> > > > > try to start the OSD will logging enabled? (debug bluefs = 20, >> >> > > > > debug bluestore = 20) >> >> > > > > >> >> > > > >> >> > > > Here: https://pastebin.com/TJXZhfcY >> >> > > > >> >> > > > Is it supposed to print something about the block.db at some point???? >> >> > > >> >> > > Can you dump the bluefs superblock for me? >> >> > > >> >> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1 >> >> > > hexdump -C /tmp/foo >> >> > > >> >> > >> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd >> >> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1 >> >> > 1+0 records in >> >> > 1+0 records out >> >> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s >> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C /tmp/foo >> >> > 00000000 01 01 5d 00 00 00 11 fb be 4d 43 31 4a b5 a4 cb |..]......MC1J...| >> >> > 00000010 99 be b7 da 72 ca 99 fd 8e 36 fc 4d 4b bc 83 d9 |....r....6.MK...| >> >> > 00000020 f5 e6 11 cd e4 b5 1d 00 00 00 00 00 00 00 00 10 |................| >> >> > 00000030 00 00 01 01 2b 00 00 00 01 80 80 40 00 00 00 00 |....+......@....| >> >> > 00000040 00 00 00 00 00 02 00 00 00 01 01 07 00 00 00 eb |................| >> >> > 00000050 b2 00 00 83 08 01 01 01 07 00 00 00 cb b2 00 00 |................| >> >> > 00000060 83 20 01 61 6d 07 be 00 00 00 00 00 00 00 00 00 |. .am...........| >> >> > 00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> >> > * >> >> > 00001000 >> >> > >> >> > >> >> >> >> Wait, we found something!!! >> >> >> >> In the 1st 4k on the block we found the block.db pointing at the wrong >> >> device (/dev/sdc1 instead of /dev/sdaa1) >> >> >> >> 00000130 6b 35 79 2b 67 3d 3d 0d 00 00 00 70 61 74 68 5f |k5y+g==....path_| >> >> 00000140 62 6c 6f 63 6b 2e 64 62 09 00 00 00 2f 64 65 76 |block.db..../dev| >> >> 00000150 2f 73 64 63 31 05 00 00 00 72 65 61 64 79 05 00 |/sdc1....ready..| >> >> 00000160 00 00 72 65 61 64 79 06 00 00 00 77 68 6f 61 6d |..ready....whoam| >> >> 00000170 69 02 00 00 00 34 38 eb c2 d7 d6 00 00 00 00 00 |i....48.........| >> >> 00000180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> >> >> >> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2 >> >> instead of /dev/sdaa2). >> >> And for the osds that are running, that block.db is correct! >> >> >> >> So.... the block.db device is persisted in the block header? But after >> >> a reboot it gets a new name. (sd* naming is famously chaotic). >> >> ceph-volume creates a softlink to the correct db dev, but it seems not used? >> > >> > Aha, yes.. the bluestore startup code looks for the value in the >> > superblock before the on in the directory. >> > >> > We can either (1) reverse that order, (and/)or (2) make ceph-volume use a >> > stable path for the device name when creating the bluestore. And/or (3) >> > use ceph-bluestore-tool set-label-key to fix it if it doesn't match (this >> > would repair old superblocks... permanently if we use the stable path >> > name). >> >> ceph-volume does not require a stable/persistent path name at all. For >> partitions we store the partuuid, and *always* make sure that we have >> the right device, because we query blkid. >> > > ceph-disk didn't require stable path names either, so it's good to > hear that this feature is still there in ceph-volume lvm. The only requirement being the PARTUUID has to be present (if wanting to use a partition). > It's been a bit unclear how to map the concept of a partitioned SSD > for filestore journals to the bluestore world. > As I understand it now, what we did is correct and fully supported? > (see the parted routine earlier in the thread...) It is fully supported, and we do test this, we just weren't able to hit it. Again, another option that would prevent this sort of issue would be to use an LV instead of a partition. > > Thanks! > > Dan > > >> In addition to that, in the case of bluestore, we go by each device >> and ensure that whatever link was done by bluestore is corrected >> before attempting to start the OSD [0] >> >> IMO bluestore should do #1, because this is already solved in the >> ceph-volume code (we knew dev names could change), but #2 and #3 are >> OK to help with this issue today. >> >> Another option would be to just avoid using the partition for block.db >> and just use an LV. >> >> [0] https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/activate.py#L155-L168 >> >> >> > >> > sage >> > >> > >> >> >> >> ... >> >> Dan & Teo >> >> >> >> >> >> >> >> >> >> >> >> > >> >> > -- dan >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > > Thanks! >> >> > > sage >> >> > > >> >> > > > >> >> > > > Here's the osd dir: >> >> > > > >> >> > > > # ls -l /var/lib/ceph/osd/ceph-48/ >> >> > > > total 24 >> >> > > > lrwxrwxrwx. 1 ceph ceph 93 Jun 7 16:46 block -> >> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> > > > lrwxrwxrwx. 1 root root 10 Jun 7 16:46 block.db -> /dev/sdaa1 >> >> > > > -rw-------. 1 ceph ceph 37 Jun 7 16:46 ceph_fsid >> >> > > > -rw-------. 1 ceph ceph 37 Jun 7 16:46 fsid >> >> > > > -rw-------. 1 ceph ceph 56 Jun 7 16:46 keyring >> >> > > > -rw-------. 1 ceph ceph 6 Jun 7 16:46 ready >> >> > > > -rw-------. 1 ceph ceph 10 Jun 7 16:46 type >> >> > > > -rw-------. 1 ceph ceph 3 Jun 7 16:46 whoami >> >> > > > >> >> > > > # ls -l /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> > > > lrwxrwxrwx. 1 root root 7 Jun 7 16:46 >> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> > > > -> ../dm-4 >> >> > > > >> >> > > > # ls -l /dev/dm-4 >> >> > > > brw-rw----. 1 ceph ceph 253, 4 Jun 7 16:46 /dev/dm-4 >> >> > > > >> >> > > > >> >> > > > --- Logical volume --- >> >> > > > LV Path >> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> > > > LV Name osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> > > > VG Name ceph-34f24306-d90c-49ff-bafb-2657a6a18010 >> >> > > > LV UUID FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC >> >> > > > LV Write Access read/write >> >> > > > LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 10:57:37 +0100 >> >> > > > LV Status available >> >> > > > # open 0 >> >> > > > LV Size <5.46 TiB >> >> > > > Current LE 1430791 >> >> > > > Segments 1 >> >> > > > Allocation inherit >> >> > > > Read ahead sectors auto >> >> > > > - currently set to 256 >> >> > > > Block device 253:4 >> >> > > > >> >> > > > --- Physical volume --- >> >> > > > PV Name /dev/sda >> >> > > > VG Name ceph-34f24306-d90c-49ff-bafb-2657a6a18010 >> >> > > > PV Size <5.46 TiB / not usable <2.59 MiB >> >> > > > Allocatable yes (but full) >> >> > > > PE Size 4.00 MiB >> >> > > > Total PE 1430791 >> >> > > > Free PE 0 >> >> > > > Allocated PE 1430791 >> >> > > > PV UUID WP0Z7C-ejSh-fpSa-a73N-H2Hz-yC78-qBezcI >> >> > > > >> >> > > > (sorry for wall o' lvm) >> >> > > > >> >> > > > -- dan >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > > Thanks! >> >> > > > > sage >> >> > > > > >> >> > > > > >> >> > > > > > -- dan >> >> > > > > > >> >> > > > > > >> >> > > > > > > >> >> > > > > > > sage >> >> > > > > > > >> >> > > > > > > >> >> > > > > > > > >> >> > > > > > > > Thanks! >> >> > > > > > > > >> >> > > > > > > > Dan >> >> > > > > > > > >> >> > > > > > > > [1] >> >> > > > > > > > >> >> > > > > > > > ====== osd.48 ====== >> >> > > > > > > > >> >> > > > > > > > [block] /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> > > > > > > > >> >> > > > > > > > type block >> >> > > > > > > > osd id 48 >> >> > > > > > > > cluster fsid dd535a7e-4647-4bee-853d-f34112615f81 >> >> > > > > > > > cluster name ceph >> >> > > > > > > > osd fsid 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> > > > > > > > db device /dev/sdaa1 >> >> > > > > > > > encrypted 0 >> >> > > > > > > > db uuid 3381a121-1c1b-4e45-a986-c1871c363edc >> >> > > > > > > > cephx lockbox secret >> >> > > > > > > > block uuid FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC >> >> > > > > > > > block device >> >> > > > > > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> > > > > > > > crush device class None >> >> > > > > > > > >> >> > > > > > > > [ db] /dev/sdaa1 >> >> > > > > > > > >> >> > > > > > > > PARTUUID 3381a121-1c1b-4e45-a986-c1871c363edc >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > [2] >> >> > > > > > > > -11> 2018-06-07 16:12:16.138407 7fba30fb4d80 1 -- - start start >> >> > > > > > > > -10> 2018-06-07 16:12:16.138516 7fba30fb4d80 1 >> >> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os >> >> > > > > > > > d/ceph-48 >> >> > > > > > > > -9> 2018-06-07 16:12:16.138801 7fba30fb4d80 1 bdev create path >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block type kernel >> >> > > > > > > > -8> 2018-06-07 16:12:16.138808 7fba30fb4d80 1 bdev(0x55eb46433a00 >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open path /v >> >> > > > > > > > ar/lib/ceph/osd/ceph-48/block >> >> > > > > > > > -7> 2018-06-07 16:12:16.138999 7fba30fb4d80 1 bdev(0x55eb46433a00 >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open size 60 >> >> > > > > > > > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 B) rotational >> >> > > > > > > > -6> 2018-06-07 16:12:16.139188 7fba30fb4d80 1 >> >> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size >> >> > > > > > > > 134217728 meta 0.01 kv 0.99 data 0 >> >> > > > > > > > -5> 2018-06-07 16:12:16.139275 7fba30fb4d80 1 bdev create path >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block type kernel >> >> > > > > > > > -4> 2018-06-07 16:12:16.139281 7fba30fb4d80 1 bdev(0x55eb46433c00 >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open path /v >> >> > > > > > > > ar/lib/ceph/osd/ceph-48/block >> >> > > > > > > > -3> 2018-06-07 16:12:16.139454 7fba30fb4d80 1 bdev(0x55eb46433c00 >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open size 60 >> >> > > > > > > > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 B) rotational >> >> > > > > > > > -2> 2018-06-07 16:12:16.139464 7fba30fb4d80 1 bluefs >> >> > > > > > > > add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo >> >> > > > > > > > ck size 5589 GB >> >> > > > > > > > -1> 2018-06-07 16:12:16.139510 7fba30fb4d80 1 bluefs mount >> >> > > > > > > > 0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1 >> >> > > > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA >> >> > > > > > > > BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o >> >> > > > > > > > s/bluestore/bluefs_types.h: In function 'static void >> >> > > > > > > > bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8 >> >> > > > > > > > *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07 >> >> > > > > > > > 16:12:16.139666 >> >> > > > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h: >> >> > > > > > > > 54: FAILED assert(pos <= end) >> >> > > > > > > > >> >> > > > > > > > ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) >> >> > > > > > > > luminous (stable) >> >> > > > > > > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> >> > > > > > > > const*)+0x110) [0x55eb3b597780] >> >> > > > > > > > 2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776) >> >> > > > > > > > [0x55eb3b52db36] >> >> > > > > > > > 3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede] >> >> > > > > > > > 4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3] >> >> > > > > > > > 5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd] >> >> > > > > > > > 6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e] >> >> > > > > > > > 7: (OSD::init()+0x3bd) [0x55eb3b02a1cd] >> >> > > > > > > > 8: (main()+0x2d07) [0x55eb3af2f977] >> >> > > > > > > > 9: (__libc_start_main()+0xf5) [0x7fba2d47b445] >> >> > > > > > > > 10: (()+0x4b7033) [0x55eb3afce033] >> >> > > > > > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> >> > > > > > > > needed to interpret this. >> >> > > > > > > > _______________________________________________ >> >> > > > > > > > ceph-users mailing list >> >> > > > > > > > ceph-users@xxxxxxxxxxxxxx >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > _______________________________________________ >> >> > > > > > ceph-users mailing list >> >> > > > > > ceph-users@xxxxxxxxxxxxxx >> >> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > > > >> >> > > > > > >> >> > > > >> >> > > > >> >> >> >> >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com