Re: SSD OSDs crashing after upgrade to 12.2.7

Wolfgang Lendl <wolfgang.lendl@xxxxxxxxxxxxxxxx> · Fri, 7 Sep 2018 10:48:41 +0200



    Hi,
    the problem still exists
for me, this happens to SSD OSDs only - I recreated all of them running 12.2.8
    this is what i got even on newly created OSDs after some time and crashes
ceph-bluestore-tool fsck -l /root/fsck-osd.0.log --log-level=20 --path /var/lib/ceph/osd/ceph-0 --deep on

2018-09-05 10:15:42.784873 7f609a311ec0 -1 bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data for sbid 0x34dbe4
2018-09-05 10:15:42.818239 7f609a311ec0 -1 bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data for sbid 0x376ccf
2018-09-05 10:15:42.863419 7f609a311ec0 -1 bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data for sbid 0x3a4e58
2018-09-05 10:15:42.887404 7f609a311ec0 -1 bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data for sbid 0x3b7f29
2018-09-05 10:15:42.958417 7f609a311ec0 -1 bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data for sbid 0x3df760
2018-09-05 10:15:42.961275 7f609a311ec0 -1 bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data for sbid 0x3e076f
2018-09-05 10:15:43.038658 7f609a311ec0 -1 bluestore(/var/lib/ceph/osd/ceph-137) fsck error: found stray shared blob data for sbid 0x3ff156

I don't know if these errors are the reason for the OSD crashes or the result of it
currently I'm trying to catch some verbose logs

see also Radoslaws reply below

>This looks quite similar to #25001 [1]. The corruption *might* be caused by
>the racy SharedBlob::put() [2] that was fixed in 12.2.6. However, more logs
>(debug_bluestore=20, debug_bdev=20) would be useful. Also you might
>want to carefully use fsck --  please take a look on the Igor's (CCed) post
>and Troy's response.
>
>Best regards,
>Radoslaw Zarzynski
>
>[1] http://tracker.ceph.com/issues/25001
>[2] http://tracker.ceph.com/issues/24211
>[3] http://tracker.ceph.com/issues/25001#note-6

I'll keep you updated
br wolfgang

    
    On 2018-09-06 09:27, Caspar Smit wrote:

    
      Hi,
        

        These reports are kind of worrying since we have a 12.2.5
          cluster too waiting to upgrade. Did you have a luck with
          upgrading to 12.2.8 or still the same behavior?
        Is there a bugtracker for this issue?

          
              Kind regards,

                Caspar
            
          
        Op di 4 sep. 2018 om 09:59 schreef Wolfgang Lendl
          <wolfgang.lendl@xxxxxxxxxxxxxxxx>:

        
        is
          downgrading from 12.2.7 to 12.2.5 an option? - I'm still
          suffering

          from high frequent osd crashes.

          my hopes are with 12.2.9 - but hope wasn't always my best
          strategy

          
          br

          wolfgang

          
          On 2018-08-30 19:18, Alfredo Deza wrote:

          > On Thu, Aug 30, 2018 at 5:24 AM, Wolfgang Lendl

          > <wolfgang.lendl@xxxxxxxxxxxxxxxx>
          wrote:

          >> Hi Alfredo,

          >>

          >>

          >> caught some logs:

          >> https://pastebin.com/b3URiA7p

          > That looks like there is an issue with bluestore. Maybe
          Radoslaw or

          > Adam might know a bit more.

          >

          >

          >> br

          >> wolfgang

          >>

          >> On 2018-08-29 15:51, Alfredo Deza wrote:

          >>> On Wed, Aug 29, 2018 at 2:06 AM, Wolfgang Lendl

          >>> <wolfgang.lendl@xxxxxxxxxxxxxxxx>
          wrote:

          >>>> Hi,

          >>>>

          >>>> after upgrading my ceph clusters from 12.2.5
          to 12.2.7  I'm experiencing random crashes from SSD OSDs
          (bluestore) - it seems that HDD OSDs are not affected.

          >>>> I destroyed and recreated some of the SSD
          OSDs which seemed to help.

          >>>>

          >>>> this happens on centos 7.5 (different kernels
          tested)

          >>>>

          >>>> /var/log/messages:

          >>>> Aug 29 10:24:08  ceph-osd: *** Caught signal
          (Segmentation fault) **

          >>>> Aug 29 10:24:08  ceph-osd: in thread
          7f8a8e69e700 thread_name:bstore_kv_final

          >>>> Aug 29 10:24:08  kernel: traps:
          bstore_kv_final[187470] general protection ip:7f8a997cf42b
          sp:7f8a8e69abc0 error:0 in
          libtcmalloc.so.4.4.5[7f8a997a8000+46000]

          >>>> Aug 29 10:24:08  systemd: ceph-osd@2.service:
          main process exited, code=killed, status=11/SEGV

          >>>> Aug 29 10:24:08  systemd: Unit
          ceph-osd@2.service entered failed state.

          >>>> Aug 29 10:24:08  systemd: ceph-osd@2.service
          failed.

          >>>> Aug 29 10:24:28  systemd: ceph-osd@2.service
          holdoff time over, scheduling restart.

          >>>> Aug 29 10:24:28  systemd: Starting Ceph
          object storage daemon osd.2...

          >>>> Aug 29 10:24:28  systemd: Started Ceph object
          storage daemon osd.2.

          >>>> Aug 29 10:24:28  ceph-osd: starting osd.2 at
          - osd_data /var/lib/ceph/osd/ceph-2
          /var/lib/ceph/osd/ceph-2/journal

          >>>> Aug 29 10:24:35  ceph-osd: *** Caught signal
          (Segmentation fault) **

          >>>> Aug 29 10:24:35  ceph-osd: in thread
          7f5f1e790700 thread_name:tp_osd_tp

          >>>> Aug 29 10:24:35  kernel: traps:
          tp_osd_tp[186933] general protection ip:7f5f43103e63
          sp:7f5f1e78a1c8 error:0 in
          libtcmalloc.so.4.4.5[7f5f430cd000+46000]

          >>>> Aug 29 10:24:35  systemd: ceph-osd@0.service:
          main process exited, code=killed, status=11/SEGV

          >>>> Aug 29 10:24:35  systemd: Unit
          ceph-osd@0.service entered failed state.

          >>>> Aug 29 10:24:35  systemd: ceph-osd@0.service
          failed

          >>> These systemd messages aren't usually helpful,
          try poking around

          >>> /var/log/ceph/ for the output on that one OSD.

          >>>

          >>> If those logs aren't useful either, try bumping
          up the verbosity (see

          >>> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/#boot-time

          >>> )

          >>>> did I hit a known issue?

          >>>> any suggestions are highly appreciated

          >>>>

          >>>>

          >>>> br

          >>>> wolfgang

          >>>>

          >>>>

          >>>>

          >>>>
          _______________________________________________

          >>>> ceph-users mailing list

          >>>> ceph-users@xxxxxxxxxxxxxx

          >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

          >>>>

          >> --

          >> Wolfgang Lendl

          >> IT Systems & Communications

          >> Medizinische Universität Wien

          >> Spitalgasse 23 / BT 88 /Ebene 00

          >> A-1090 Wien

          >> Tel: +43 1 40160-21231

          >> Fax: +43 1 40160-921200

          >>

          >>

          
          -- 

          Wolfgang Lendl

          IT Systems & Communications

          Medizinische Universität Wien

          Spitalgasse 23 / BT 88 /Ebene 00

          A-1090 Wien

          Tel: +43 1 40160-21231

          Fax: +43 1 40160-921200

          
          _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

        
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
    -- 
Wolfgang Lendl
IT Systems & Communications
Medizinische Universität Wien
Spitalgasse 23 / BT 88 /Ebene 00
A-1090 Wien
Tel: +43 1 40160-21231
Fax: +43 1 40160-921200


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com