Re: SSD OSDs crashing after upgrade to 12.2.7

Wolfgang Lendl <wolfgang.lendl@xxxxxxxxxxxxxxxx> · Fri, 7 Sep 2018 14:13:51 +0200



Hello,

got new logs - if this snip is not sufficent, I can provide the full log

https://pastebin.com/dKBzL9AW

br+thx wolfgang


On 2018-09-05 01:55, Radoslaw Zarzynski wrote:
> In the log following trace can be found:
>
>      0> 2018-08-30 13:11:01.014708 7ff2dd344700 -1 *** Caught signal
> (Segmentation fault) **
>  in thread 7ff2dd344700 thread_name:osd_srv_agent
>
>  ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
> luminous (stable)
>  1: (()+0xa48ec1) [0x5652900ffec1]
>  2: (()+0xf6d0) [0x7ff2f7c206d0]
>  3: (BlueStore::_wctx_finish(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*,
> std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>,
> std::allocator<BlueStore::SharedBlob*> >*)+0xb4) [0x56528ffe3954]
>  4: (BlueStore::_do_truncate(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>, unsigned long,
> std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>,
> std::allocator<BlueStore::SharedBlob*> >*)+0x2c2) [0x56528fffd642]
>  5: (BlueStore::_do_remove(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>)+0xc6) [0x56528fffdf86]
>  6: (BlueStore::_remove(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>&)+0x94) [0x56528ffff9f4]
>  7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)+0x15af) [0x56529001280f]
>  8: ...
>
> This looks quite similar to #25001 [1]. The corruption *might* be caused by
> the racy SharedBlob::put() [2] that was fixed in 12.2.6. However, more logs
> (debug_bluestore=20, debug_bdev=20) would be useful. Also you might
> want to carefully use fsck --  please take a look on the Igor's (CCed) post
> and Troy's response.
>
> Best regards,
> Radoslaw Zarzynski
>
> [1] http://tracker.ceph.com/issues/25001
> [2] http://tracker.ceph.com/issues/24211
> [3] http://tracker.ceph.com/issues/25001#note-6
>
> On Tue, Sep 4, 2018 at 12:54 PM, Alfredo Deza <adeza@xxxxxxxxxx> wrote:
>> On Tue, Sep 4, 2018 at 3:59 AM, Wolfgang Lendl
>> <wolfgang.lendl@xxxxxxxxxxxxxxxx> wrote:
>>> is downgrading from 12.2.7 to 12.2.5 an option? - I'm still suffering
>>> from high frequent osd crashes.
>>> my hopes are with 12.2.9 - but hope wasn't always my best strategy
>> 12.2.8 just went out. I think that Adam or Radoslaw might have some
>> time to check those logs now
>>
>>> br
>>> wolfgang
>>>
>>> On 2018-08-30 19:18, Alfredo Deza wrote:
>>>> On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl
>>>> <wolfgang.lendl@xxxxxxxxxxxxxxxx> wrote:
>>>>> Hi Alfredo,
>>>>>
>>>>>
>>>>> caught some logs:
>>>>> https://pastebin.com/b3URiA7p
>>>> That looks like there is an issue with bluestore. Maybe Radoslaw or
>>>> Adam might know a bit more.
>>>>
>>>>
>>>>> br
>>>>> wolfgang
>>>>>
>>>>> On 2018-08-29 15:51, Alfredo Deza wrote:
>>>>>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl
>>>>>> <wolfgang.lendl@xxxxxxxxxxxxxxxx> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> after upgrading my ceph clusters from 12.2.5 to 12.2.7  I'm experiencing random crashes from SSD OSDs (bluestore) - it seems that HDD OSDs are not affected.
>>>>>>> I destroyed and recreated some of the SSD OSDs which seemed to help.
>>>>>>>
>>>>>>> this happens on centos 7.5 (different kernels tested)
>>>>>>>
>>>>>>> /var/log/messages:
>>>>>>> Aug 29 10:24:08  ceph-osd: *** Caught signal (Segmentation fault) **
>>>>>>> Aug 29 10:24:08  ceph-osd: in thread 7f8a8e69e700 thread_name:bstore_kv_final
>>>>>>> Aug 29 10:24:08  kernel: traps: bstore_kv_final[187470] general protection ip:7f8a997cf42b sp:7f8a8e69abc0 error:0 in libtcmalloc.so.4.4.5[7f8a997a8000+46000]
>>>>>>> Aug 29 10:24:08  systemd: ceph-osd@2.service: main process exited, code=killed, status=11/SEGV
>>>>>>> Aug 29 10:24:08  systemd: Unit ceph-osd@2.service entered failed state.
>>>>>>> Aug 29 10:24:08  systemd: ceph-osd@2.service failed.
>>>>>>> Aug 29 10:24:28  systemd: ceph-osd@2.service holdoff time over, scheduling restart.
>>>>>>> Aug 29 10:24:28  systemd: Starting Ceph object storage daemon osd.2...
>>>>>>> Aug 29 10:24:28  systemd: Started Ceph object storage daemon osd.2.
>>>>>>> Aug 29 10:24:28  ceph-osd: starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>>>>>>> Aug 29 10:24:35  ceph-osd: *** Caught signal (Segmentation fault) **
>>>>>>> Aug 29 10:24:35  ceph-osd: in thread 7f5f1e790700 thread_name:tp_osd_tp
>>>>>>> Aug 29 10:24:35  kernel: traps: tp_osd_tp[186933] general protection ip:7f5f43103e63 sp:7f5f1e78a1c8 error:0 in libtcmalloc.so.4.4.5[7f5f430cd000+46000]
>>>>>>> Aug 29 10:24:35  systemd: ceph-osd@0.service: main process exited, code=killed, status=11/SEGV
>>>>>>> Aug 29 10:24:35  systemd: Unit ceph-osd@0.service entered failed state.
>>>>>>> Aug 29 10:24:35  systemd: ceph-osd@0.service failed
>>>>>> These systemd messages aren't usually helpful, try poking around
>>>>>> /var/log/ceph/ for the output on that one OSD.
>>>>>>
>>>>>> If those logs aren't useful either, try bumping up the verbosity (see
>>>>>> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time
>>>>>> )
>>>>>>> did I hit a known issue?
>>>>>>> any suggestions are highly appreciated
>>>>>>>
>>>>>>>
>>>>>>> br
>>>>>>> wolfgang
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>> --
>>>>> Wolfgang Lendl
>>>>> IT Systems & Communications
>>>>> Medizinische Universität Wien
>>>>> Spitalgasse 23 / BT 88 /Ebene 00
>>>>> A-1090 Wien
>>>>> Tel: +43 1 40160-21231
>>>>> Fax: +43 1 40160-921200
>>>>>
>>>>>
>>> --
>>> Wolfgang Lendl
>>> IT Systems & Communications
>>> Medizinische Universität Wien
>>> Spitalgasse 23 / BT 88 /Ebene 00
>>> A-1090 Wien
>>> Tel: +43 1 40160-21231
>>> Fax: +43 1 40160-921200
>>>
>>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com