On Tue, May 14, 2019 at 5:13 PM Kevin Flöh <kevin.floeh@xxxxxxx> wrote: > > ok, so now we see at least a diffrence in the recovery state: > > "recovery_state": [ > { > "name": "Started/Primary/Peering/Incomplete", > "enter_time": "2019-05-14 14:15:15.650517", > "comment": "not enough complete instances of this PG" > }, > { > "name": "Started/Primary/Peering", > "enter_time": "2019-05-14 14:15:15.243756", > "past_intervals": [ > { > "first": "49767", > "last": "59580", > "all_participants": [ > { > "osd": 2, > "shard": 0 > }, > { > "osd": 4, > "shard": 1 > }, > { > "osd": 23, > "shard": 2 > }, > { > "osd": 24, > "shard": 0 > }, > { > "osd": 72, > "shard": 1 > }, > { > "osd": 79, > "shard": 3 > } > ], > "intervals": [ > { > "first": "59562", > "last": "59563", > "acting": "4(1),24(0),79(3)" > }, > { > "first": "59564", > "last": "59567", > "acting": "23(2),24(0),79(3)" > }, > { > "first": "59570", > "last": "59574", > "acting": "4(1),23(2),79(3)" > }, > { > "first": "59577", > "last": "59580", > "acting": "4(1),23(2),24(0)" > } > ] > } > ], > "probing_osds": [ > "2(0)", > "4(1)", > "23(2)", > "24(0)", > "72(1)", > "79(3)" > ], > "down_osds_we_would_probe": [], > "peering_blocked_by": [] > }, > { > "name": "Started", > "enter_time": "2019-05-14 14:15:15.243663" > } > ], > > the peering does not seem to be blocked anymore. But still there is no > recovery going on. Is there anything else we can try? What is the state of the hdd's which had osds 4 & 23? You may be able to use ceph-objectstore-tool to export those PG shards and import to another operable OSD. -- dan > > > On 14.05.19 11:02 vorm., Dan van der Ster wrote: > > On Tue, May 14, 2019 at 10:59 AM Kevin Flöh <kevin.floeh@xxxxxxx> wrote: > >> > >> On 14.05.19 10:08 vorm., Dan van der Ster wrote: > >> > >> On Tue, May 14, 2019 at 10:02 AM Kevin Flöh <kevin.floeh@xxxxxxx> wrote: > >> > >> On 13.05.19 10:51 nachm., Lionel Bouton wrote: > >> > >> Le 13/05/2019 à 16:20, Kevin Flöh a écrit : > >> > >> Dear ceph experts, > >> > >> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] > >> Here is what happened: One osd daemon could not be started and > >> therefore we decided to mark the osd as lost and set it up from > >> scratch. Ceph started recovering and then we lost another osd with > >> the same behavior. We did the same as for the first osd. > >> > >> With 3+1 you only allow a single OSD failure per pg at a given time. > >> You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 > >> separate servers (assuming standard crush rules) is a death sentence > >> for the data on some pgs using both of those OSD (the ones not fully > >> recovered before the second failure). > >> > >> OK, so the 2 OSDs (4,23) failed shortly one after the other but we think > >> that the recovery of the first was finished before the second failed. > >> Nonetheless, both problematic pgs have been on both OSDs. We think, that > >> we still have enough shards left. For one of the pgs, the recovery state > >> looks like this: > >> > >> "recovery_state": [ > >> { > >> "name": "Started/Primary/Peering/Incomplete", > >> "enter_time": "2019-05-09 16:11:48.625966", > >> "comment": "not enough complete instances of this PG" > >> }, > >> { > >> "name": "Started/Primary/Peering", > >> "enter_time": "2019-05-09 16:11:48.611171", > >> "past_intervals": [ > >> { > >> "first": "49767", > >> "last": "59313", > >> "all_participants": [ > >> { > >> "osd": 2, > >> "shard": 0 > >> }, > >> { > >> "osd": 4, > >> "shard": 1 > >> }, > >> { > >> "osd": 23, > >> "shard": 2 > >> }, > >> { > >> "osd": 24, > >> "shard": 0 > >> }, > >> { > >> "osd": 72, > >> "shard": 1 > >> }, > >> { > >> "osd": 79, > >> "shard": 3 > >> } > >> ], > >> "intervals": [ > >> { > >> "first": "58860", > >> "last": "58861", > >> "acting": "4(1),24(0),79(3)" > >> }, > >> { > >> "first": "58875", > >> "last": "58877", > >> "acting": "4(1),23(2),24(0)" > >> }, > >> { > >> "first": "59002", > >> "last": "59009", > >> "acting": "4(1),23(2),79(3)" > >> }, > >> { > >> "first": "59010", > >> "last": "59012", > >> "acting": "2(0),4(1),23(2),79(3)" > >> }, > >> { > >> "first": "59197", > >> "last": "59233", > >> "acting": "23(2),24(0),79(3)" > >> }, > >> { > >> "first": "59234", > >> "last": "59313", > >> "acting": "23(2),24(0),72(1),79(3)" > >> } > >> ] > >> } > >> ], > >> "probing_osds": [ > >> "2(0)", > >> "4(1)", > >> "23(2)", > >> "24(0)", > >> "72(1)", > >> "79(3)" > >> ], > >> "down_osds_we_would_probe": [], > >> "peering_blocked_by": [], > >> "peering_blocked_by_detail": [ > >> { > >> "detail": "peering_blocked_by_history_les_bound" > >> } > >> ] > >> }, > >> { > >> "name": "Started", > >> "enter_time": "2019-05-09 16:11:48.611121" > >> } > >> ], > >> Is there a chance to recover this pg from the shards on OSDs 2, 72, 79? > >> ceph pg repair/deep-scrub/scrub did not work. > >> > >> repair/scrub are not related to this problem so they won't help. > >> > >> How exactly did you use the osd_find_best_info_ignore_history_les option? > >> > >> One correct procedure would be to set it to true in ceph.conf, then > >> restart each of the probing_osd's above. > >> (Once the PG has peered, you need to unset the option and restart > >> those osds again). > >> > >> We executed ceph --admin-daemon /var/run/ceph/ceph-osd.X.asok config set osd_find_best_info_ignore_history_les true > >> > >> And then we restarted the affected OSDs. I guess this is doing the same, right? > > No that doesn't work. That just sets it in memory but then the option > > is reset to the default when you restart the OSD. > > You need to set it in ceph.conf on the OSD host. > > > > -- dan > > > > > > > > > > > > > > > > > > > > > > > >> We are also worried about the behind on trimming of the mds or is this > >> not too problematic? > >> > >> Trimming requires IO on PGs, and the mds is almost certainly stuck on > >> those incomplete PGs. > >> Solve the incomplete, and then address the MDS later if it doesn't > >> resolve itself. > >> > >> > >> -- dan > >> > >> ok, then we don't have to worry about this for now. > >> > >> > >> Best regards, > >> > >> Kevin > >> > >> > >> > >> > >> MDS_TRIM 1 MDSs behind on trimming > >> mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46178/128) > >> max_segments: 128, num_segments: 46178 > >> > >> > >> Depending on the data stored (CephFS ?) you probably can recover most > >> of it but some of it is irremediably lost. > >> > >> If you can recover the data from the failed OSD at the time they > >> failed you might be able to recover some of your lost data (with the > >> help of Ceph devs), if not there's nothing to do. > >> > >> In the later case I'd add a new server to use at least 3+2 for a fresh > >> pool instead of 3+1 and begin moving the data to it. > >> > >> The 12.2 + 13.2 mix is a potential problem in addition to the one > >> above but it's a different one. > >> > >> Best regards, > >> > >> Lionel > >> > >> The idea for the future is to set up a new ceph with 3+2 with 8 servers > >> in total and of course with consistent versions on all nodes. > >> > >> > >> Best regards, > >> > >> Kevin > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com