Re: Core dump blue store luminous 12.2.7

Eugen Block <eblock@xxxxxx> · Mon, 06 Aug 2018 19:21:17 +0000

Hi,

the missing "ln -snf ..." is probably related to missing LV tags. When  
we had to migrate OSD journals to another SSD because of a failed SSD  
we noticed the same difference to new (healthy) OSDs. Compare the tags  
of your Logical Volumes to their actual UUIDs and all the other  
information to match to the actual setup.

lvs --noheadings -o +lv_tags [YOUR_LV]

shows you the tags of one or all your LVs. Probably, there's a wrong  
UUID of the respective wal.db LV, the tag is something like  
ceph.wal_uuid (I don't have access to our cluster right now).

If you find the false UUID, just delete the existing tag with

lvchange --deltag ...

and add the correct tag with

lvchange --addtag ...

I hope this helps to resolve your issue.

Regards,
Eugen

Zitat von Benoit Hudzia <benoit@xxxxxxxxxxxxxxx>:

Hi , I got another failure and this time was able to investigate a bit.

1. If i delete the OSD and recreate it with the exact same setup, the OSD
boot up successfully
2., however, diffing the log between the failed run and the successful one
I noticed something odd: https://www.diffchecker.com/sSHrxwC9

We have in every successful OSD startup the following lines executed

Running command: ln -snf
/dev/inaugurator/ed5a15e8-20b9-4312-991c-1a4d91b284bd-wal
/var/lib/ceph/osd/ceph-5/block.wal
Running command: chown -h ceph:ceph
/dev/inaugurator/ed5a15e8-20b9-4312-991c-1a4d91b284bd-wal

However, in every failed run this two line are missing . Any idea why this
would occur?

Last but not least: I have setup the log level to 20, however, it seems
that the bluestore crash before even getting to the point where things are
logged.

Regards
Benoit

On Mon, 6 Aug 2018 at 13:07, Benoit Hudzia <benoit@xxxxxxxxxxxxxxx> wrote:

Thanks, I ll try to check if i can reproduce it. It's really sporadic and
occurs every 20-30 runs , I might check if it always occurs on the same
server , maybe an HW issue.

On Mon, 6 Aug 2018 at 06:12, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

This isn't very complete as it just indicates that something went wrong
with a read. Since I presume it happens on every startup, it may help if
you set "debug bluestore = 20" in the OSD's config and provide that log
(perhaps with ceph-post-file if it's large).
I also went through my email and see
https://tracker.ceph.com/issues/24639, if you have anything in common
with that deployment. (But you probably don't; an error on read generally
is about bad state on disk that was created somewhere else.)
-Greg

On Sun, Aug 5, 2018 at 3:19 PM Benoit Hudzia <benoit@xxxxxxxxxxxxxxx>
wrote:

Hi,

We start to see core dump occurring with luminous 12.2.7. Any idea where
this is coming from ?? We started having issues with bluestore  
core dumping
when we moved to 12.2.6 and hoped that 12.2.7 would have fixed  
it. We might
need to revert back to 12.2.5 as it seems a lot more stable.

Pastebin link for full log: https://pastebin.com/na4E3m3N

Core dump :

starting osd.7 at - osd_data /var/lib/ceph/osd/ceph-7  
/var/lib/ceph/osd/ceph-7/journal
*** Caught signal (Segmentation fault) **
 in thread 7fa8830cfd80 thread_name:ceph-osd
 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)  
luminous (stable)
 1: (()+0xa48ec1) [0x55e010afcec1]
 2: (()+0xf6d0) [0x7fa8807966d0]
 3: (BlueFS::_read(BlueFS::FileReader*,  
BlueFS::FileReaderBuffer*, unsigned long, unsigned long,  
ceph::buffer::list*, char*)+0x452) [0x55e010ab1e72]
 4: (BlueFS::_replay(bool)+0x2ef) [0x55e010ac526f]
 5: (BlueFS::mount()+0x1d4) [0x55e010ac8fd4]
 6: (BlueStore::_open_db(bool)+0x1847) [0x55e0109e2da7]
 7: (BlueStore::_mount(bool)+0x40e) [0x55e010a1406e]
 8: (OSD::init()+0x3bd) [0x55e0105c934d]
 9: (main()+0x2d07) [0x55e0104ce947]
 10: (__libc_start_main()+0xf5) [0x7fa87f7a3445]
 11: (()+0x4b9003) [0x55e01056d003]
2018-08-03 21:58:12.248736 7fa8830cfd80 -1 *** Caught signal  
(Segmentation fault) **
 in thread 7fa8830cfd80 thread_name:ceph-osd

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)  
luminous (stable)
 1: (()+0xa48ec1) [0x55e010afcec1]
 2: (()+0xf6d0) [0x7fa8807966d0]
 3: (BlueFS::_read(BlueFS::FileReader*,  
BlueFS::FileReaderBuffer*, unsigned long, unsigned long,  
ceph::buffer::list*, char*)+0x452) [0x55e010ab1e72]
 4: (BlueFS::_replay(bool)+0x2ef) [0x55e010ac526f]
 5: (BlueFS::mount()+0x1d4) [0x55e010ac8fd4]
 6: (BlueStore::_open_db(bool)+0x1847) [0x55e0109e2da7]
 7: (BlueStore::_mount(bool)+0x40e) [0x55e010a1406e]
 8: (OSD::init()+0x3bd) [0x55e0105c934d]
 9: (main()+0x2d07) [0x55e0104ce947]
 10: (__libc_start_main()+0xf5) [0x7fa87f7a3445]
 11: (()+0x4b9003) [0x55e01056d003]
 NOTE: a copy of the executable, or `objdump -rdS <executable>`  
is needed to interpret this.

     0> 2018-08-03 21:58:12.248736 7fa8830cfd80 -1 *** Caught  
signal (Segmentation fault) **
 in thread 7fa8830cfd80 thread_name:ceph-osd

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)  
luminous (stable)
 1: (()+0xa48ec1) [0x55e010afcec1]
 2: (()+0xf6d0) [0x7fa8807966d0]
 3: (BlueFS::_read(BlueFS::FileReader*,  
BlueFS::FileReaderBuffer*, unsigned long, unsigned long,  
ceph::buffer::list*, char*)+0x452) [0x55e010ab1e72]
 4: (BlueFS::_replay(bool)+0x2ef) [0x55e010ac526f]
 5: (BlueFS::mount()+0x1d4) [0x55e010ac8fd4]
 6: (BlueStore::_open_db(bool)+0x1847) [0x55e0109e2da7]
 7: (BlueStore::_mount(bool)+0x40e) [0x55e010a1406e]
 8: (OSD::init()+0x3bd) [0x55e0105c934d]
 9: (main()+0x2d07) [0x55e0104ce947]
 10: (__libc_start_main()+0xf5) [0x7fa87f7a3445]
 11: (()+0x4b9003) [0x55e01056d003]
 NOTE: a copy of the executable, or `objdump -rdS <executable>`  
is needed to interpret this.

/osd_entrypoint: line 98: 119388 Segmentation fault      (core  
dumped) /usr/bin/ceph-osd -f --cluster "${CEPH_CLUSTERNAME}" --id  
"${OSD_ID}" --setuser root --setgroup root

--
Dr. Benoit Hudzia

Mobile (UK): +44 (0) 75 346 78673
Mobile (IE):  +353 (0) 89 219 3675 <+353%2089%20219%203675>
Email: benoit@xxxxxxxxxxxxxxx

Web <http://www.stratoscale.com/> | Blog
<http://www.stratoscale.com/blog/> | Twitter
<https://twitter.com/Stratoscale> | Google+
<https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
 | Linkedin <https://www.linkedin.com/company/stratoscale>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Dr. Benoit Hudzia

Mobile (UK): +44 (0) 75 346 78673
Mobile (IE):  +353 (0) 89 219 3675
Email: benoit@xxxxxxxxxxxxxxx

Web <http://www.stratoscale.com/> | Blog
<http://www.stratoscale.com/blog/> | Twitter
<https://twitter.com/Stratoscale> | Google+
<https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
 | Linkedin <https://www.linkedin.com/company/stratoscale>

--
Dr. Benoit Hudzia

Mobile (UK): +44 (0) 75 346 78673
Mobile (IE):  +353 (0) 89 219 3675
Email: benoit@xxxxxxxxxxxxxxx

Web <http://www.stratoscale.com/> | Blog <http://www.stratoscale.com/blog/>
 | Twitter <https://twitter.com/Stratoscale> | Google+
<https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
 | Linkedin <https://www.linkedin.com/company/stratoscale>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com