Hi Mario Perhaps its covered under proxmox support, Do you have support on your proxmox install from the guys in Proxmox? Otherwise you can always buy from Redhat https://www.redhat.com/en/technologies/storage/ceph On Thu, Jun 30, 2016 at 7:37 AM, Mario Giammarco <mgiammarco@xxxxxxxxx> wrote: > Last two questions: > 1) I have used other systems in the past. In case of split brain or serious > problems they offered me to choose which copy is "good" and then work again. > Is there a way to tell ceph that all is ok? This morning again I have 19 > incomplete pgs after recovery > 2) Where can I find paid support? I mean someone that logs in to my cluster > and tell cephs that all is active+clean > > Thanks, > Mario > > Il giorno mer 29 giu 2016 alle ore 16:08 Mario Giammarco > <mgiammarco@xxxxxxxxx> ha scritto: >> >> This time at the end of recovery procedure you described it was like most >> pgs active+clean 20 pgs incomplete. >> After that when trying to use the cluster I got "request blocked more >> than" and no vm can start. >> I know that something has happened after the broken disk, probably a >> server reboot. I am investigating. >> But even if I find the origin of the problem it will not help in finding a >> solution now. >> So I am using my time in repairing the pool only to save the production >> data and I will throw away the rest. >> Now after marking all pgs as complete with ceph_objectstore_tool I see >> that: >> >> 1) ceph has put out three hdds ( I suppose due to scrub but it is my only >> my idea, I will check logs) BAD >> 2) it is recovering for objects degraded and misplaced GOOD >> 3) vm are not usable yet BAD >> 4) I see some pgs in state down+peering (I hope is not BAD) >> >> Regarding 1) how I can put again that three hdds in the cluster? Should I >> remove them from crush and start again? >> Can I tell ceph that they are not bad? >> Mario >> >> Il giorno mer 29 giu 2016 alle ore 15:34 Lionel Bouton >> <lionel+ceph@xxxxxxxxxxx> ha scritto: >>> >>> Hi, >>> >>> Le 29/06/2016 12:00, Mario Giammarco a écrit : >>> > Now the problem is that ceph has put out two disks because scrub has >>> > failed (I think it is not a disk fault but due to mark-complete) >>> >>> There is something odd going on. I've only seen deep-scrub failing (ie >>> detect one inconsistency and marking the pg so) so I'm not sure what >>> happens in the case of a "simple" scrub failure but what should not >>> happen is the whole OSD going down on scrub of deepscrub fairure which >>> you seem to imply did happen. >>> Do you have logs for these two failures giving a hint at what happened >>> (probably /var/log/ceph/ceph-osd.<n>.log) ? Any kernel log pointing to >>> hardware failure(s) around the time these events happened ? >>> >>> Another point : you said that you had one disk "broken". Usually ceph >>> handles this case in the following manner : >>> - the OSD detects the problem and commit suicide (unless it's configured >>> to ignore IO errors which is not the default), >>> - your cluster is then in degraded state with one OSD down/in, >>> - after a timeout (several minutes), Ceph decides that the OSD won't >>> come up again soon and marks the OSD "out" (so one OSD down/out), >>> - as the OSD is out, crush adapts pg positions based on the remaining >>> available OSDs and bring back all degraded pg to clean state by creating >>> missing replicas while moving pgs around. You see a lot of IO, many pg >>> in wait_backfill/backfilling states at this point, >>> - when all is done the cluster is back to HEALTH_OK >>> >>> When your disk was broken and you waited 24 hours how far along this >>> process was your cluster ? >>> >>> Best regards, >>> >>> Lionel > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com