Re: osd is immidietly down and uses CPU full.

Makito Nishimiya <nishimiya@xxxxxxxxxxx> · Sun, 2 Feb 2020 12:08:43 +0900

Hi.

This is the cluster informastion.

-------------- /var/log/ceph/ceph.osd.1.log ---------------------------
2020-02-01 03:47:20.635504 7f86f4e40700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f86fe35e700' had timed out after 15
2020-02-01 03:47:20.635521 7f86f4f41700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f86fe35e700' had timed out after 15
2020-02-01 03:47:20.747876 7f86f3b2d700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f86fe35e700' had timed out after 15
2020-02-01 03:47:20.747903 7f86f3c2e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f86fe35e700' had timed out after 15
2020-02-01 03:47:21.152436 7f86e2a2e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f86fe35e700' had timed out after 15
2020-02-01 03:47:21.152441 7f86e2a2e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f86fe35e700' had suicide timed out after 150
2020-02-01 03:47:21.157963 7f86e2a2e700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f86e2a2e700 time 2020-02-01 03:47:21.152463
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f875259f425]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7f87524dede1]
 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f87524df63e]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x93f) [0x7f8751f141df]
 5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x7f8751f1540b]
 6: (DispatchQueue::fast_dispatch(Message*)+0x76) [0x7f875265f9d6]
 7: (Pipe::reader()+0x1dff) [0x7f875269c68f]
 8: (Pipe::Reader::entry()+0xd) [0x7f87526a41ad]
 9: (()+0x7e25) [0x7f87502b8e25]
 10: (clone()+0x6d) [0x7f874e94234d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-10000> 2020-02-01 03:45:29.491110 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.202.203:6805/5717 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f877512a200 con 0x7f8774eb6e80
 -9999> 2020-02-01 03:45:29.491109 7f86e8245700  1 -- 10.1.201.201:0/3892596 <== osd.9 10.1.201.202:6811/3006646 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f8781383400 con 0x7f8774eaf480
 -9998> 2020-02-01 03:45:29.491123 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.201.203:6803/5220 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f877512bc00 con 0x7f8774eae100
 -9997> 2020-02-01 03:45:29.491137 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.202.203:6803/5220 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f877512aa00 con 0x7f8774eae280
 -9996> 2020-02-01 03:45:29.491150 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.201.203:6811/15624 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f877512ac00 con 0x7f8774eae700
 -9995> 2020-02-01 03:45:29.491148 7f86ec1a4700  1 -- 10.1.201.201:0/3892596 <== osd.8 10.1.202.202:6812/2131331 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f8781380200 con 0x7f8774eb5200
 -9994> 2020-02-01 03:45:29.491163 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.202.203:6811/15624 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f877512be00 con 0x7f8774eae880
 -9993> 2020-02-01 03:45:29.491176 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.201.203:6801/4089 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f8775128a00 con 0x7f8774eaf780
 -9992> 2020-02-01 03:45:29.491169 7f86e8447700  1 -- 10.1.201.201:0/3892596 <== osd.9 10.1.202.202:6811/3006646 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f8781380400 con 0x7f8774eaf600
 -9991> 2020-02-01 03:45:29.491167 7f86e975a700  1 -- 10.1.201.201:0/3892596 <== osd.10 10.1.201.202:6801/3005449 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f8781380000 con 0x7f8774eb7c00
 -9990> 2020-02-01 03:45:29.491192 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.202.203:6801/4089 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f8775128c00 con 0x7f8774eaf900
 -9989> 2020-02-01 03:45:29.491206 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.201.203:6813/17285 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f8775451e00 con 0x7f8774ee3180
 -9988> 2020-02-01 03:45:29.491184 7f86e9255700  1 -- 10.1.201.201:0/3892596 <== osd.11 10.1.201.202:6809/2004102 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f8781385800 con 0x7f8774eade00
 -9987> 2020-02-01 03:45:29.491202 7f86e9558700  1 -- 10.1.201.201:0/3892596 <== osd.10 10.1.202.202:6801/3005449 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f8781385400 con 0x7f8774eb7d80
 -9986> 2020-02-01 03:45:29.491233 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.202.203:6813/17285 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f877544fe00 con 0x7f8774ee3300
 -9985> 2020-02-01 03:45:29.491238 7f86e9053700  1 -- 10.1.201.201:0/3892596 <== osd.11 10.1.202.202:6813/2004102 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f8781383800 con 0x7f8774eadf80
 -9984> 2020-02-01 03:45:29.491234 7f86e874a700  1 -- 10.1.201.201:0/3892596 <== osd.12 10.1.201.202:6803/4953 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f8781383a00 con 0x7f8774eae400
 -9983> 2020-02-01 03:45:29.491247 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.201.204:6807/13468 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f8775576000 con 0x7f8774eb5e00
 -9982> 2020-02-01 03:45:29.491258 7f86ecbae700  1 -- 10.1.201.201:0/3892596 <== osd.13 10.1.201.202:6805/2390794 12 ==== osd_ping(ping_reply e126359 stamp 2020-02-01 03:45:29.490356) v2 ==== 47+0+0 (4038678352 0 0) 0x7f87813a6400 con 0x7f8774ee1980
 -9981> 2020-02-01 03:45:29.491266 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.202.204:6807/13468 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f8775576e00 con 0x7f8774eb5f80
 -9980> 2020-02-01 03:45:29.491291 7f86f9253700  1 -- 10.1.201.201:0/3892596 --> 10.1.201.204:6809/410784 -- osd_ping(ping e126357 stamp 2020-02-01 03:45:29.490356) v2 -- ?+0 0x7f8774ffe400 con 0x7f8774eac900

---------------- ceph -s ------------------------------------------------------

    cluster 57d2e172-0306-4f8a-946c-a4ae1e95b26b
     health HEALTH_WARN
            7 pgs backfill_wait
            11 pgs backfilling
            9 pgs degraded
            4 pgs recovery_wait
            21 pgs stuck unclean
            6 pgs undersized
            100 requests are blocked > 32 sec
            recovery 649692/75245727 objects degraded (0.863%)
            recovery 1027694/75245727 objects misplaced (1.366%)
            recovery 1/24998284 unfound (0.000%)
            pool default.rgw.buckets.data has many more objects per pg than average (too few pgs?)
     monmap e1: 3 mons at {cephmon01=10.1.202.199:6789/0,cephmon02=10.1.202.198:6789/0,cephmon03=10.1.202.197:6789/0}
            election epoch 562, quorum 0,1,2 cephmon03,cephmon02,cephmon01
     osdmap e127248: 42 osds: 37 up, 37 in; 20 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v92257695: 1568 pgs, 14 pools, 18271 GB data, 24412 kobjects
            55189 GB used, 45363 GB / 100553 GB avail
            649692/75245727 objects degraded (0.863%)
            1027694/75245727 objects misplaced (1.366%)
            1/24998284 unfound (0.000%)
                1540 active+clean
                   8 active+remapped+backfilling
                   5 active+remapped+wait_backfill
                   3 active+clean+scrubbing
                   3 active+clean+scrubbing+deep
                   3 active+undersized+degraded+remapped+backfilling
                   2 active+undersized+degraded+remapped+wait_backfill
                   2 active+recovery_wait+degraded
                   1 active+recovery_wait+undersized+degraded+remapped
                   1 active+recovery_wait+degraded+remapped
recovery io 239 MB/s, 187 objects/s
  client io 575 kB/s wr, 0 op/s rd, 37 op/s wr

---------------- ceph osd tree ------------------------------------------------------

[root@ceph01 ceph]# ceph osd tree
ID WEIGHT    TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 108.19864 root default
-2  19.09381     host ceph01
 0   2.72769         osd.0        up  1.00000          1.00000
 1   2.72769         osd.1      down        0          1.00000 <-- now down
 2   2.72769         osd.2        up  1.00000          1.00000
 5   2.72769         osd.5        up  1.00000          1.00000
 6   2.72768         osd.6        up  1.00000          1.00000
 3   2.72768         osd.3        up  1.00000          1.00000
 4   2.72769         osd.4        up  1.00000          1.00000
-3  19.09383     host ceph02
 8   2.72769         osd.8        up  1.00000          1.00000
 9   2.72769         osd.9        up  1.00000          1.00000
10   2.72769         osd.10       up  1.00000          1.00000
12   2.72769         osd.12       up  1.00000          1.00000
11   2.72769         osd.11       up  1.00000          1.00000
 7   2.72768         osd.7        up  1.00000          1.00000
13   2.72769         osd.13       up  1.00000          1.00000
-4  16.36626     host ceph03
14   2.72769         osd.14       up  1.00000          1.00000
16   2.72769         osd.16       up  1.00000          1.00000
17   2.72769         osd.17       up  1.00000          1.00000
19   2.72769         osd.19       up  1.00000          1.00000
15   1.81850         osd.15       up  1.00000          1.00000
18   1.81850         osd.18       up  1.00000          1.00000
20   1.81850         osd.20       up  1.00000          1.00000
-5  15.45706     host ceph04
23   2.72769         osd.23       up  1.00000          1.00000
24   2.72769         osd.24       up  1.00000          1.00000
27   2.72769         osd.27     down        0          1.00000 <-- more then 3month ago
21   1.81850         osd.21       up  1.00000          1.00000
22   1.81850         osd.22       up  1.00000          1.00000
25   1.81850         osd.25       up  1.00000          1.00000
26   1.81850         osd.26       up  1.00000          1.00000
-6  19.09384     host ceph05
28   2.72769         osd.28       up  1.00000          1.00000
29   2.72769         osd.29       up  1.00000          1.00000
30   2.72769         osd.30       up  1.00000          1.00000
31   2.72769         osd.31     down        0          1.00000 <-- more then 3month ago
32   2.72769         osd.32       up  1.00000          1.00000
34   2.72769         osd.34       up  1.00000          1.00000
33   2.72769         osd.33     down        0          1.00000 <-- more then 3month ago
-7  19.09384     host ceph06
35   2.72769         osd.35       up  1.00000          1.00000
36   2.72769         osd.36       up  1.00000          1.00000
37   2.72769         osd.37       up  1.00000          1.00000
39   2.72769         osd.39       up  1.00000          1.00000
40   2.72769         osd.40       up  1.00000          1.00000
41   2.72769         osd.41       up  1.00000          1.00000
38   2.72769         osd.38     down        0          1.00000 <-- more then 3month ago

------------------------------

On 2020/02/02 11:20, 西宮 牧人 wrote:
Servers: 6 (include 7osds) total 42osdsl
OS: Centos7
Ceph: 10.2.5

Hi, everyone

The cluster is used for VM image storage and object storage.
And I have a bucket which has more than 20 million objects.

Now, I have a problem that cluster blocks operation.

Suddenly cluster blocked operations, then VMs can't read disk.
After a few hours, osd.1 was down.

There is no disk fail messages in dmesg.
And no error is in smartctl -a /dev/sde.

I tried to wake up osd.1, but osd.1 is down soon.
Just after re-waking up osd.1, VM can access to the disk.
But osd.1 always uses 100% CPU, then cluster marked osd.1 down and the osd was dead by suicide timeout.

I found that the osdmap epoch of osd.1 is different from other one.
So I think osd.1 was dead.

Question.
(1) Why does the epoch of osd.1 differ from other osds ones ?

  I checked all osds oldest_map and newest_map by ~ceph daemon osd.X status~
  All osd's ecpoch are same number except osd.1

(2) Why does osd.1 use CPU full?

  After the cluster marked osd.1 down, osd.1 keeps up busy.
  When I execute "ceph tell osd.1 injectargs --debug-ms 5/1", osd.1 doesn't answer.

Thank you.

--
#######################################################
##
## 忍者ツールズ株式会社
## 西宮　牧人
##
## 〒155?0033
## 東京都世田谷区代田6-6-1 TOKYU REIT下北沢スクエア B1F
## Tel:03-4405-9825
## Tel:050-1745-6318(直通)
## http://corp.ninja.co.jp/
## nishimiya@xxxxxxxxxxx
##
#######################################################
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx