osd is immidietly down and uses CPU full.

西宮牧人 <nishimiya@xxxxxxxxxxx> · Sun, 2 Feb 2020 11:20:05 +0900

Servers: 6 (include 7osds) total 42osdsl
OS: Centos7
Ceph: 10.2.5

Hi, everyone

The cluster is used for VM image storage and object storage.
And I have a bucket which has more than 20 million objects.

Now, I have a problem that cluster blocks operation.

Suddenly cluster blocked operations, then VMs can't read disk.
After a few hours, osd.1 was down.

There is no disk fail messages in dmesg.
And no error is in smartctl -a /dev/sde.

I tried to wake up osd.1, but osd.1 is down soon.
Just after re-waking up osd.1, VM can access to the disk.
But osd.1 always uses 100% CPU, then cluster marked osd.1 down and the osd was dead by suicide timeout.

I found that the osdmap epoch of osd.1 is different from other one.
So I think osd.1 was dead.

Question.
(1) Why does the epoch of osd.1 differ from other osds ones ?

 I checked all osds oldest_map and newest_map by ~ceph daemon osd.X status~
 All osd's ecpoch are same number except osd.1

(2) Why does osd.1 use CPU full?

 After the cluster marked osd.1 down, osd.1 keeps up busy.
 When I execute "ceph tell osd.1 injectargs --debug-ms 5/1", osd.1 doesn't answer.

Thank you.
--
Makito
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx