Hi, Le 29/06/2016 12:00, Mario Giammarco a écrit : > Now the problem is that ceph has put out two disks because scrub has > failed (I think it is not a disk fault but due to mark-complete) There is something odd going on. I've only seen deep-scrub failing (ie detect one inconsistency and marking the pg so) so I'm not sure what happens in the case of a "simple" scrub failure but what should not happen is the whole OSD going down on scrub of deepscrub fairure which you seem to imply did happen. Do you have logs for these two failures giving a hint at what happened (probably /var/log/ceph/ceph-osd.<n>.log) ? Any kernel log pointing to hardware failure(s) around the time these events happened ? Another point : you said that you had one disk "broken". Usually ceph handles this case in the following manner : - the OSD detects the problem and commit suicide (unless it's configured to ignore IO errors which is not the default), - your cluster is then in degraded state with one OSD down/in, - after a timeout (several minutes), Ceph decides that the OSD won't come up again soon and marks the OSD "out" (so one OSD down/out), - as the OSD is out, crush adapts pg positions based on the remaining available OSDs and bring back all degraded pg to clean state by creating missing replicas while moving pgs around. You see a lot of IO, many pg in wait_backfill/backfilling states at this point, - when all is done the cluster is back to HEALTH_OK When your disk was broken and you waited 24 hours how far along this process was your cluster ? Best regards, Lionel _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com