Servers: 6 (include 7osds) total 42osdsl
OS: Centos7
Ceph: 10.2.5
Hi, everyone
The cluster is used for VM image storage and object storage.
And I have a bucket which has more than 20 million objects.
Now, I have a problem that cluster blocks operation.
Suddenly cluster blocked operations, then VMs can't read disk.
After a few hours, osd.1 was down.
There is no disk fail messages in dmesg.
And no error is in smartctl -a /dev/sde.
I tried to wake up osd.1, but osd.1 is down soon.
Just after re-waking up osd.1, VM can access to the disk.
But osd.1 always uses 100% CPU, then cluster marked osd.1 down and the osd was dead by suicide timeout.
I found that the osdmap epoch of osd.1 is different from other one.
So I think osd.1 was dead.
Question.
(1) Why does the epoch of osd.1 differ from other osds ones ?
I checked all osds oldest_map and newest_map by ~ceph daemon osd.X status~
All osd's ecpoch are same number except osd.1
(2) Why does osd.1 use CPU full?
After the cluster marked osd.1 down, osd.1 keeps up busy.
When I execute "ceph tell osd.1 injectargs --debug-ms 5/1", osd.1 doesn't answer.
Thank you.
--
Makito
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx