Re: Another cluster completely hang

Mario Giammarco <mgiammarco@xxxxxxxxx> · Wed, 29 Jun 2016 14:08:23 +0000

This time at the end of recovery procedure you described it was like most pgs active+clean 20 pgs incomplete.After that when trying to use the cluster I got "request blocked more than" and no vm can start.
I know that something has happened after the broken disk, probably a server reboot. I am investigating.
But even if I find the origin of the problem it will not help in finding a solution now.
So I am using my time in repairing the pool only to save the production data and I will throw away the rest.
Now after marking all pgs as complete with ceph_objectstore_tool I see that:

1) ceph has put out three hdds ( I suppose due to scrub but it is my only my idea, I will check logs) BAD
2) it is recovering for objects degraded and misplaced GOOD
3) vm are not usable yet BAD
4) I see some pgs in state down+peering (I hope is not BAD)

Regarding 1) how I can put again that three hdds in the cluster? Should I remove them from crush and start again?
Can I tell ceph that they are not bad?
Mario

Il giorno mer 29 giu 2016 alle ore 15:34 Lionel Bouton <lionel+ceph@xxxxxxxxxxx> ha scritto:
Hi,

Le 29/06/2016 12:00, Mario Giammarco a écrit :

> Now the problem is that ceph has put out two disks because scrub  has

> failed (I think it is not a disk fault but due to mark-complete)

There is something odd going on. I've only seen deep-scrub failing (ie

detect one inconsistency and marking the pg so) so I'm not sure what

happens in the case of a "simple" scrub failure but what should not

happen is the whole OSD going down on scrub of deepscrub fairure which

you seem to imply did happen.

Do you have logs for these two failures giving a hint at what happened

(probably /var/log/ceph/ceph-osd.<n>.log) ? Any kernel log pointing to

hardware failure(s) around the time these events happened ?

Another point : you said that you had one disk "broken". Usually ceph

handles this case in the following manner :

- the OSD detects the problem and commit suicide (unless it's configured

to ignore IO errors which is not the default),

- your cluster is then in degraded state with one OSD down/in,

- after a timeout (several minutes), Ceph decides that the OSD won't

come up again soon and marks the OSD "out" (so one OSD down/out),

- as the OSD is out, crush adapts pg positions based on the remaining

available OSDs and bring back all degraded pg to clean state by creating

missing replicas while moving pgs around. You see a lot of IO, many pg

in wait_backfill/backfilling states at this point,

- when all is done the cluster is back to HEALTH_OK

When your disk was broken and you waited 24 hours how far along this

process was your cluster ?

Best regards,

Lionel

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com