Mimic cluster is offline and not healing

by morphin <morphinwithyou@xxxxxxxxx> · Thu, 27 Sep 2018 15:19:08 +0300

Hello,

I am writing this e-mail about an incident that has started last weekend.
There seems to something wrong with my e-mail. Some of my e-mails did
not reach-out. So I decided to start an new thread here and start from
begining.
One can find the email related e-mail thread
(http://lists.ceph.com/pipermail/ceph-community-ceph.com/2018-September/000292.html).

We have a cluster with 28 servers and 168 OSDs. OSDs are blustore on
NL-SAS (non SMR) and WAL+DB is NvME. My distro is Archlinux.

Last weekend I have upgraded from 12.2.4 to 13.2.1. And cluster did
not start since OSDs were stuck in booting state. Sage helped me about
it (thanks!) by creating MONs store.db from OSDs via ceph-object-tool.
At first everything was perfect.

However two days later I had an most unfortunate accident. 7 of my
servers crashed at the same time. When they came up cluster was in
HEALTH_ERR state. 2 of those servers were MONs (I have 3 total).

I’ve been working for 3days collecting and testing. But I could not
make any progress.

First of all I’ve double checked OS health, network health, disk
health and they have no problem. Then my further investigation results
are these:
I have rbd pool. There is 33TB of VM data.
As soon as OSD starts it makes lots of I/O on blustore disks (NL-SAS).
This makes OSD near unresponsive. Yu can’t even injectargs.
Cluster does not settle. I left it alone for 24 hour but OSD up count
dropped to ~50.
OSDs are loging too much slow request.
OSDs are loging lots of heartbeat messages. And eventually they are
marked as down.

Latest cluster status: https://paste.ubuntu.com/p/BhCHmVNZsX/
Ceph.conf : https://paste.ubuntu.com/p/FtY9gfpncN/
Sample OSD log: https://paste.ubuntu.com/p/ZsqpcQVRsj/
Mon log: https://paste.ubuntu.com/p/9T8QtMYZWT/
I/O utilization on disks: https://paste.ubuntu.com/p/mrCTKYpBZR/

SO I think my problem is really weird. Somehow pool cannot heal itself.

OSDs make %95 disk I/O utilization and peering is way too slow.. The
OSD I/O didnt end after 72 hours.

Because of the high I/O OSD's cant get an answer from other OSD's and
complains to the monitor. Monitor marking them "down" but I see OSD's
still running.

For example the "ceph -s" command says 50 OSD is up but I see 153 osd
process running at background and trying to reach other OSD's. So it
is very confusing and certainly not progressing.

We're trying every possible strategy. Now we stopped OSDs. Then we
start one OSD at a time with a server. First we start, wait for OSD to
finish I/O than move the next OSD in the same server. We figured-out
that even if the first OSDs I/O is finished second OSD triggers it
again. So when we started the final sitxh OSD, the rest of five OSDs
did &95 I/O too. And first OSD I/O finished in 8 minutes. But sixth
OSD I/O finished in 34 minutes!

Then we moved the next server. As soon as we started this servers OSD
the previously finished OSD started to do I/O again. So we gained
nothing.

Now we are plaining to set noup, then start all 168 OSDs and then
unset noup. Maybe this will prevent OSDs to make I/O over and over
again.

After 72 hours I believe we may hit a bug. Any help would be greatly
appreciated.

We're on IRC 7/24. Thanks to: Be:El, peetaur2, degreaser, Ti and IcePic.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com