Re: v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "

nokia ceph <nokiacephusers@xxxxxxxxx> · Wed, 20 Sep 2017 16:21:44 +0530

Hello Henrik,
Thanks for the update, I applied below  two changes and will monitor the cluster.

~~~
bluestore_deferred_throttle_bytes = 0

bluestore_throttle_bytes = 0
~~~

Attached gdb trace after installed debug rpm's 

Thanks

On Wed, Sep 20, 2017 at 12:18 PM, Henrik Korkuc <lists@xxxxxxxxx> wrote:

    On 17-09-20 08:06, nokia ceph wrote:

        Hello,

        Env:- RHEL 7.2
          , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore

        We are writing to ceph via
          librados C API  . Testing with rados no issues.  

        The same we tested with
          Jewel/kraken without any issue. Need your view how to debug
          this issue? 

    maybe similar to http://tracker.ceph.com/issues/21180? It seems it
    was resolved for me with mentioned fix. You could apply mentioned
    config options and see if it helps (and build newer version, if
    able).

        >> 

        OSD.log
        ======

        ~~~

          2017-09-18 14:51:59.895746 7f1e744e0700  0
            log_channel(cluster) log [WRN] : slow request 60.068824
            seconds old, received at 2017-09-18 14:50:59.826849:
            MOSDECSubOpWriteReply(1.132s0 1350/1344
            ECSubWriteReply(tid=971, last_complete=1350'153,
            committed=1, applied=0)) currently queued_for_pg
          2017-09-18 14:51:59.895749 7f1e744e0700  0
            log_channel(cluster) log [WRN] : slow request 60.068737
            seconds old, received at 2017-09-18 14:50:59.826936:
            MOSDECSubOpWriteReply(1.132s0 1350/1344
            ECSubWriteReply(tid=971, last_complete=0'0, committed=0,
            applied=1)) currently queued_for_pg
          2017-09-18 14:51:59.895754 7f1e744e0700  0
            log_channel(cluster) log [WRN] : slow request 60.067539
            seconds old, received at 2017-09-18 14:50:59.828134:
            MOSDECSubOpWriteReply(1.132s0 1350/1344
            ECSubWriteReply(tid=971, last_complete=1350'153,
            committed=1, applied=0)) currently queued_for_pg
          2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 1359 k (1083 k + 276 k)
          2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 1066 k (1066 k + 0 )
          2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 643 k (643 k + 0 )
          2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 1049 k (1049 k + 0 )
          2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 896 k (896 k + 0 )
          2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350
            share_map_peer 0x7f1e8dbf2800 already has epoch 1350
          2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350
            share_map_peer 0x7f1e8dbf2800 already has epoch 1350
          2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350
            OSD::ms_dispatch: ping magic: 0 v1
          2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350
            do_waiters -- start
          2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350
            do_waiters -- finish
          2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350
            _dispatch 0x7f1e90923a40 ping magic: 0 v1
          2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350
            ping from client.414556
          2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 1359 k (1083 k + 276 k)
          2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 1066 k (1066 k + 0 )
          2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 643 k (643 k + 0 )
          2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 1049 k (1049 k + 0 )
          2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard
            target 102 M meta/data ratios 0.5 + 0 (52428 k + 0 ),
             current 896 k (896 k + 0 )
          2017-09-18 14:52:00.145445 7f1e784a6700  1 heartbeat_map
            is_healthy 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed
            out after 60
          2017-09-18 14:52:00.145450 7f1e784a6700  1 heartbeat_map
            is_healthy 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed
            out after 60

          2017-09-18 14:52:00.145496 7f1e784a6700  1 heartbeat_map
            is_healthy 'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed
            out after 60
          2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350
            internal heartbeat not healthy, dropping ping request
          2017-09-18 14:52:00.146224 7f1e78ca7700  1 heartbeat_map
            is_healthy 'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed
            out after 60
          2017-09-18 14:52:00.146226 7f1e78ca7700  1 heartbeat_map
            is_healthy 'OSD::osd_op_tp thread 0x7f1e624bc700' had timed
            out after 60

        ~~~

         thread apply all bt

          Thread 54 (LWP 479360):
          #0  0x00007f1e7b5606d5 in ?? ()
          #1  0x0000000000000000 in ?? ()

          Thread 53 (LWP 484888):
          #0  0x00007f1e7a644b7d in ?? ()
          #1  0x0000000000000000 in ?? ()

          Thread 52 (LWP 484177):
          #0  0x00007f1e7b5606d5 in ?? ()
          #1  0x0000000a00000000 in ?? ()
          #2  0x00007f1e88d8df98 in ?? ()
          #3  0x00007f1e88d8df48 in ?? ()
          #4  0x000000000000000a in ?? ()
          #5  0x00007f1e5ccaf7f8 in ?? ()
          #6  0x00007f1e7e45b9ee in ?? ()
          #7  0x00007f1e88d8d860 in ?? ()
          #8  0x00007f1e8e6e5500 in ?? ()
          #9  0x00007f1e889881c0 in ?? ()
          #10 0x00007f1e7e3e9ea0 in ?? ()
          #11 0x0000000000000000 in ?? ()

          Thread 51 (LWP 484176):
          #0  0x00007f1e7b5606d5 in ?? ()
          #1  0x0000000000000000 in ?? ()

          Thread 50 (LWP 484175):
          #0  0x00007f1e7b5606d5 in ?? ()
          #1  0x0000000000000000 in ?? ()

          Thread 49 (LWP 484174):
          #0  0x00007f1e7b5606d5 in ?? ()
          #1  0x0000000000000000 in ?? ()

        ~~~

        Thanks

    I think you may be missing debug packages as there are no thread
    names.

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
gdb.txt.gz

Description: GNU Zip compressed data
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com