On Thu, Jun 7, 2018 at 3:04 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > On Thu, Jun 7, 2018 at 8:58 PM Alfredo Deza <adeza@xxxxxxxxxx> wrote: >> >> On Thu, Jun 7, 2018 at 2:45 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> > On Thu, Jun 7, 2018 at 6:58 PM Alfredo Deza <adeza@xxxxxxxxxx> wrote: >> >> >> >> On Thu, Jun 7, 2018 at 12:09 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> > On Thu, 7 Jun 2018, Dan van der Ster wrote: >> >> >> On Thu, Jun 7, 2018 at 5:36 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >> >> >> > >> >> >> > On Thu, Jun 7, 2018 at 5:34 PM Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> >> > > >> >> >> > > On Thu, 7 Jun 2018, Dan van der Ster wrote: >> >> >> > > > On Thu, Jun 7, 2018 at 4:41 PM Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> >> > > > > >> >> >> > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote: >> >> >> > > > > > On Thu, Jun 7, 2018 at 4:33 PM Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> >> > > > > > > >> >> >> > > > > > > On Thu, 7 Jun 2018, Dan van der Ster wrote: >> >> >> > > > > > > > Hi all, >> >> >> > > > > > > > >> >> >> > > > > > > > We have an intermittent issue where bluestore osds sometimes fail to >> >> >> > > > > > > > start after a reboot. >> >> >> > > > > > > > The osds all fail the same way [see 2], failing to open the superblock. >> >> >> > > > > > > > One one particular host, there are 24 osds and 4 SSDs partitioned for >> >> >> > > > > > > > the block.db's. The affected non-starting OSDs all have block.db on >> >> >> > > > > > > > the same ssd (/dev/sdaa). >> >> >> > > > > > > > >> >> >> > > > > > > > The osds are all running 12.2.5 on latest centos 7.5 and were created >> >> >> > > > > > > > by ceph-volume lvm, e.g. see [1]. >> >> >> > > > > > > > >> >> >> > > > > > > > This seems like a permissions or similar issue related to the >> >> >> > > > > > > > ceph-volume tooling. >> >> >> > > > > > > > Any clues how to debug this further? >> >> >> > > > > > > >> >> >> > > > > > > I take it the OSDs start up if you try again? >> >> >> > > > > > >> >> >> > > > > > Hey. >> >> >> > > > > > No, they don't. For example, we do this `ceph-volume lvm activate 48 >> >> >> > > > > > 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5` several times and its the same >> >> >> > > > > > mount failure every time. >> >> >> > > > > >> >> >> > > > > That sounds like a bluefs bug then, not a ceph-volume issue. Can you >> >> >> > > > > try to start the OSD will logging enabled? (debug bluefs = 20, >> >> >> > > > > debug bluestore = 20) >> >> >> > > > > >> >> >> > > > >> >> >> > > > Here: https://pastebin.com/TJXZhfcY >> >> >> > > > >> >> >> > > > Is it supposed to print something about the block.db at some point???? >> >> >> > > >> >> >> > > Can you dump the bluefs superblock for me? >> >> >> > > >> >> >> > > dd if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1 >> >> >> > > hexdump -C /tmp/foo >> >> >> > > >> >> >> > >> >> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# dd >> >> >> > if=/dev/sdaa1 of=/tmp/foo bs=4K skip=1 count=1 >> >> >> > 1+0 records in >> >> >> > 1+0 records out >> >> >> > 4096 bytes (4.1 kB) copied, 0.000320003 s, 12.8 MB/s >> >> >> > [17:35][root@p06253939y61826 (qa:ceph/dwight/osd*18) ~]# hexdump -C /tmp/foo >> >> >> > 00000000 01 01 5d 00 00 00 11 fb be 4d 43 31 4a b5 a4 cb |..]......MC1J...| >> >> >> > 00000010 99 be b7 da 72 ca 99 fd 8e 36 fc 4d 4b bc 83 d9 |....r....6.MK...| >> >> >> > 00000020 f5 e6 11 cd e4 b5 1d 00 00 00 00 00 00 00 00 10 |................| >> >> >> > 00000030 00 00 01 01 2b 00 00 00 01 80 80 40 00 00 00 00 |....+......@....| >> >> >> > 00000040 00 00 00 00 00 02 00 00 00 01 01 07 00 00 00 eb |................| >> >> >> > 00000050 b2 00 00 83 08 01 01 01 07 00 00 00 cb b2 00 00 |................| >> >> >> > 00000060 83 20 01 61 6d 07 be 00 00 00 00 00 00 00 00 00 |. .am...........| >> >> >> > 00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> >> >> > * >> >> >> > 00001000 >> >> >> > >> >> >> > >> >> >> >> >> >> Wait, we found something!!! >> >> >> >> >> >> In the 1st 4k on the block we found the block.db pointing at the wrong >> >> >> device (/dev/sdc1 instead of /dev/sdaa1) >> >> >> >> >> >> 00000130 6b 35 79 2b 67 3d 3d 0d 00 00 00 70 61 74 68 5f |k5y+g==....path_| >> >> >> 00000140 62 6c 6f 63 6b 2e 64 62 09 00 00 00 2f 64 65 76 |block.db..../dev| >> >> >> 00000150 2f 73 64 63 31 05 00 00 00 72 65 61 64 79 05 00 |/sdc1....ready..| >> >> >> 00000160 00 00 72 65 61 64 79 06 00 00 00 77 68 6f 61 6d |..ready....whoam| >> >> >> 00000170 69 02 00 00 00 34 38 eb c2 d7 d6 00 00 00 00 00 |i....48.........| >> >> >> 00000180 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| >> >> >> >> >> >> It is similarly wrong for another broken osd.53 (block.db is /dev/sdc2 >> >> >> instead of /dev/sdaa2). >> >> >> And for the osds that are running, that block.db is correct! >> >> >> >> >> >> So.... the block.db device is persisted in the block header? But after >> >> >> a reboot it gets a new name. (sd* naming is famously chaotic). >> >> >> ceph-volume creates a softlink to the correct db dev, but it seems not used? >> >> > >> >> > Aha, yes.. the bluestore startup code looks for the value in the >> >> > superblock before the on in the directory. >> >> > >> >> > We can either (1) reverse that order, (and/)or (2) make ceph-volume use a >> >> > stable path for the device name when creating the bluestore. And/or (3) >> >> > use ceph-bluestore-tool set-label-key to fix it if it doesn't match (this >> >> > would repair old superblocks... permanently if we use the stable path >> >> > name). >> >> >> >> ceph-volume does not require a stable/persistent path name at all. For >> >> partitions we store the partuuid, and *always* make sure that we have >> >> the right device, because we query blkid. >> >> >> > >> > ceph-disk didn't require stable path names either, so it's good to >> > hear that this feature is still there in ceph-volume lvm. >> >> The only requirement being the PARTUUID has to be present (if wanting >> to use a partition). >> >> > It's been a bit unclear how to map the concept of a partitioned SSD >> > for filestore journals to the bluestore world. >> > As I understand it now, what we did is correct and fully supported? >> > (see the parted routine earlier in the thread...) >> >> It is fully supported, and we do test this, we just weren't able to >> hit it. Again, another option that >> would prevent this sort of issue would be to use an LV instead of a partition. > > Sage mentioned earlier that normally ceph-volume creates an LV for the > db or wal device. No, ceph-volume doesn't do that. You have three options (as described here [0]): create an LV for data, and partition(s) for db/wal and pass them to ceph-volume create create an LV for data, and LVs for db/wal and pass them to ceph-volume create *or* pass a whole device so that ceph-volume creates the vg/lv for you, with the caveat being: no separate wal or db is done. Basically a 1:1 mapping for the device to OSD. [0] http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#bluestore > How do we trigger that feature? > I've understood from the docs that a LV for db or wal would be up to > the operator to pre-create and pass to c-v lvm create... That is correct. Only if you pass a raw device as *single* input you will get an LV created. > > Cheers, Dan > >> >> > >> > Thanks! >> > >> > Dan >> > >> > >> >> In addition to that, in the case of bluestore, we go by each device >> >> and ensure that whatever link was done by bluestore is corrected >> >> before attempting to start the OSD [0] >> >> >> >> IMO bluestore should do #1, because this is already solved in the >> >> ceph-volume code (we knew dev names could change), but #2 and #3 are >> >> OK to help with this issue today. >> >> >> >> Another option would be to just avoid using the partition for block.db >> >> and just use an LV. >> >> >> >> [0] https://github.com/ceph/ceph/blob/master/src/ceph-volume/ceph_volume/devices/lvm/activate.py#L155-L168 >> >> >> >> >> >> > >> >> > sage >> >> > >> >> > >> >> >> >> >> >> ... >> >> >> Dan & Teo >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> > -- dan >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > > Thanks! >> >> >> > > sage >> >> >> > > >> >> >> > > > >> >> >> > > > Here's the osd dir: >> >> >> > > > >> >> >> > > > # ls -l /var/lib/ceph/osd/ceph-48/ >> >> >> > > > total 24 >> >> >> > > > lrwxrwxrwx. 1 ceph ceph 93 Jun 7 16:46 block -> >> >> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> >> > > > lrwxrwxrwx. 1 root root 10 Jun 7 16:46 block.db -> /dev/sdaa1 >> >> >> > > > -rw-------. 1 ceph ceph 37 Jun 7 16:46 ceph_fsid >> >> >> > > > -rw-------. 1 ceph ceph 37 Jun 7 16:46 fsid >> >> >> > > > -rw-------. 1 ceph ceph 56 Jun 7 16:46 keyring >> >> >> > > > -rw-------. 1 ceph ceph 6 Jun 7 16:46 ready >> >> >> > > > -rw-------. 1 ceph ceph 10 Jun 7 16:46 type >> >> >> > > > -rw-------. 1 ceph ceph 3 Jun 7 16:46 whoami >> >> >> > > > >> >> >> > > > # ls -l /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> >> > > > lrwxrwxrwx. 1 root root 7 Jun 7 16:46 >> >> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> >> > > > -> ../dm-4 >> >> >> > > > >> >> >> > > > # ls -l /dev/dm-4 >> >> >> > > > brw-rw----. 1 ceph ceph 253, 4 Jun 7 16:46 /dev/dm-4 >> >> >> > > > >> >> >> > > > >> >> >> > > > --- Logical volume --- >> >> >> > > > LV Path >> >> >> > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> >> > > > LV Name osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> >> > > > VG Name ceph-34f24306-d90c-49ff-bafb-2657a6a18010 >> >> >> > > > LV UUID FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC >> >> >> > > > LV Write Access read/write >> >> >> > > > LV Creation host, time p06253939y61826.cern.ch, 2018-03-15 10:57:37 +0100 >> >> >> > > > LV Status available >> >> >> > > > # open 0 >> >> >> > > > LV Size <5.46 TiB >> >> >> > > > Current LE 1430791 >> >> >> > > > Segments 1 >> >> >> > > > Allocation inherit >> >> >> > > > Read ahead sectors auto >> >> >> > > > - currently set to 256 >> >> >> > > > Block device 253:4 >> >> >> > > > >> >> >> > > > --- Physical volume --- >> >> >> > > > PV Name /dev/sda >> >> >> > > > VG Name ceph-34f24306-d90c-49ff-bafb-2657a6a18010 >> >> >> > > > PV Size <5.46 TiB / not usable <2.59 MiB >> >> >> > > > Allocatable yes (but full) >> >> >> > > > PE Size 4.00 MiB >> >> >> > > > Total PE 1430791 >> >> >> > > > Free PE 0 >> >> >> > > > Allocated PE 1430791 >> >> >> > > > PV UUID WP0Z7C-ejSh-fpSa-a73N-H2Hz-yC78-qBezcI >> >> >> > > > >> >> >> > > > (sorry for wall o' lvm) >> >> >> > > > >> >> >> > > > -- dan >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > > Thanks! >> >> >> > > > > sage >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > > -- dan >> >> >> > > > > > >> >> >> > > > > > >> >> >> > > > > > > >> >> >> > > > > > > sage >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > > Thanks! >> >> >> > > > > > > > >> >> >> > > > > > > > Dan >> >> >> > > > > > > > >> >> >> > > > > > > > [1] >> >> >> > > > > > > > >> >> >> > > > > > > > ====== osd.48 ====== >> >> >> > > > > > > > >> >> >> > > > > > > > [block] /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> >> > > > > > > > >> >> >> > > > > > > > type block >> >> >> > > > > > > > osd id 48 >> >> >> > > > > > > > cluster fsid dd535a7e-4647-4bee-853d-f34112615f81 >> >> >> > > > > > > > cluster name ceph >> >> >> > > > > > > > osd fsid 99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> >> > > > > > > > db device /dev/sdaa1 >> >> >> > > > > > > > encrypted 0 >> >> >> > > > > > > > db uuid 3381a121-1c1b-4e45-a986-c1871c363edc >> >> >> > > > > > > > cephx lockbox secret >> >> >> > > > > > > > block uuid FQkRxS-No7X-ajkP-5L3N-K22a-IXg6-QLceZC >> >> >> > > > > > > > block device >> >> >> > > > > > > > /dev/ceph-34f24306-d90c-49ff-bafb-2657a6a18010/osd-block-99fd8e36-fc4d-4bbc-83d9-f5e611cde4b5 >> >> >> > > > > > > > crush device class None >> >> >> > > > > > > > >> >> >> > > > > > > > [ db] /dev/sdaa1 >> >> >> > > > > > > > >> >> >> > > > > > > > PARTUUID 3381a121-1c1b-4e45-a986-c1871c363edc >> >> >> > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > > [2] >> >> >> > > > > > > > -11> 2018-06-07 16:12:16.138407 7fba30fb4d80 1 -- - start start >> >> >> > > > > > > > -10> 2018-06-07 16:12:16.138516 7fba30fb4d80 1 >> >> >> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-48) _mount path /var/lib/ceph/os >> >> >> > > > > > > > d/ceph-48 >> >> >> > > > > > > > -9> 2018-06-07 16:12:16.138801 7fba30fb4d80 1 bdev create path >> >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block type kernel >> >> >> > > > > > > > -8> 2018-06-07 16:12:16.138808 7fba30fb4d80 1 bdev(0x55eb46433a00 >> >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open path /v >> >> >> > > > > > > > ar/lib/ceph/osd/ceph-48/block >> >> >> > > > > > > > -7> 2018-06-07 16:12:16.138999 7fba30fb4d80 1 bdev(0x55eb46433a00 >> >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open size 60 >> >> >> > > > > > > > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 B) rotational >> >> >> > > > > > > > -6> 2018-06-07 16:12:16.139188 7fba30fb4d80 1 >> >> >> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-48) _set_cache_sizes cache_size >> >> >> > > > > > > > 134217728 meta 0.01 kv 0.99 data 0 >> >> >> > > > > > > > -5> 2018-06-07 16:12:16.139275 7fba30fb4d80 1 bdev create path >> >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block type kernel >> >> >> > > > > > > > -4> 2018-06-07 16:12:16.139281 7fba30fb4d80 1 bdev(0x55eb46433c00 >> >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open path /v >> >> >> > > > > > > > ar/lib/ceph/osd/ceph-48/block >> >> >> > > > > > > > -3> 2018-06-07 16:12:16.139454 7fba30fb4d80 1 bdev(0x55eb46433c00 >> >> >> > > > > > > > /var/lib/ceph/osd/ceph-48/block) open size 60 >> >> >> > > > > > > > 01172414464 (0x57541c00000, 5589 GB) block_size 4096 (4096 B) rotational >> >> >> > > > > > > > -2> 2018-06-07 16:12:16.139464 7fba30fb4d80 1 bluefs >> >> >> > > > > > > > add_block_device bdev 1 path /var/lib/ceph/osd/ceph-48/blo >> >> >> > > > > > > > ck size 5589 GB >> >> >> > > > > > > > -1> 2018-06-07 16:12:16.139510 7fba30fb4d80 1 bluefs mount >> >> >> > > > > > > > 0> 2018-06-07 16:12:16.142930 7fba30fb4d80 -1 >> >> >> > > > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILA >> >> >> > > > > > > > BLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/o >> >> >> > > > > > > > s/bluestore/bluefs_types.h: In function 'static void >> >> >> > > > > > > > bluefs_fnode_t::_denc_finish(ceph::buffer::ptr::iterator&, __u8 >> >> >> > > > > > > > *, __u8*, char**, uint32_t*)' thread 7fba30fb4d80 time 2018-06-07 >> >> >> > > > > > > > 16:12:16.139666 >> >> >> > > > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/os/bluestore/bluefs_types.h: >> >> >> > > > > > > > 54: FAILED assert(pos <= end) >> >> >> > > > > > > > >> >> >> > > > > > > > ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) >> >> >> > > > > > > > luminous (stable) >> >> >> > > > > > > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> >> >> > > > > > > > const*)+0x110) [0x55eb3b597780] >> >> >> > > > > > > > 2: (bluefs_super_t::decode(ceph::buffer::list::iterator&)+0x776) >> >> >> > > > > > > > [0x55eb3b52db36] >> >> >> > > > > > > > 3: (BlueFS::_open_super()+0xfe) [0x55eb3b50cede] >> >> >> > > > > > > > 4: (BlueFS::mount()+0xe3) [0x55eb3b5250c3] >> >> >> > > > > > > > 5: (BlueStore::_open_db(bool)+0x173d) [0x55eb3b43ebcd] >> >> >> > > > > > > > 6: (BlueStore::_mount(bool)+0x40e) [0x55eb3b47025e] >> >> >> > > > > > > > 7: (OSD::init()+0x3bd) [0x55eb3b02a1cd] >> >> >> > > > > > > > 8: (main()+0x2d07) [0x55eb3af2f977] >> >> >> > > > > > > > 9: (__libc_start_main()+0xf5) [0x7fba2d47b445] >> >> >> > > > > > > > 10: (()+0x4b7033) [0x55eb3afce033] >> >> >> > > > > > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> >> >> > > > > > > > needed to interpret this. >> >> >> > > > > > > > _______________________________________________ >> >> >> > > > > > > > ceph-users mailing list >> >> >> > > > > > > > ceph-users@xxxxxxxxxxxxxx >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > _______________________________________________ >> >> >> > > > > > ceph-users mailing list >> >> >> > > > > > ceph-users@xxxxxxxxxxxxxx >> >> >> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > > > > > >> >> >> > > > > > >> >> >> > > > >> >> >> > > > >> >> >> >> >> >> >> >> > _______________________________________________ >> >> > ceph-users mailing list >> >> > ceph-users@xxxxxxxxxxxxxx >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com