On 14.05.19 5:24 nachm., Dan van der Ster wrote:
On Tue, May 14, 2019 at 5:13 PM Kevin Flöh <kevin.floeh@xxxxxxx> wrote:ok, so now we see at least a diffrence in the recovery state: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-14 14:15:15.650517", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-14 14:15:15.243756", "past_intervals": [ { "first": "49767", "last": "59580", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "59562", "last": "59563", "acting": "4(1),24(0),79(3)" }, { "first": "59564", "last": "59567", "acting": "23(2),24(0),79(3)" }, { "first": "59570", "last": "59574", "acting": "4(1),23(2),79(3)" }, { "first": "59577", "last": "59580", "acting": "4(1),23(2),24(0)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [] }, { "name": "Started", "enter_time": "2019-05-14 14:15:15.243663" } ], the peering does not seem to be blocked anymore. But still there is no recovery going on. Is there anything else we can try?What is the state of the hdd's which had osds 4 & 23? You may be able to use ceph-objectstore-tool to export those PG shards and import to another operable OSD. -- danOn 14.05.19 11:02 vorm., Dan van der Ster wrote:On Tue, May 14, 2019 at 10:59 AM Kevin Flöh <kevin.floeh@xxxxxxx> wrote:On 14.05.19 10:08 vorm., Dan van der Ster wrote: On Tue, May 14, 2019 at 10:02 AM Kevin Flöh <kevin.floeh@xxxxxxx> wrote: On 13.05.19 10:51 nachm., Lionel Bouton wrote: Le 13/05/2019 à 16:20, Kevin Flöh a écrit : Dear ceph experts, [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...] Here is what happened: One osd daemon could not be started and therefore we decided to mark the osd as lost and set it up from scratch. Ceph started recovering and then we lost another osd with the same behavior. We did the same as for the first osd. With 3+1 you only allow a single OSD failure per pg at a given time. You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 separate servers (assuming standard crush rules) is a death sentence for the data on some pgs using both of those OSD (the ones not fully recovered before the second failure). OK, so the 2 OSDs (4,23) failed shortly one after the other but we think that the recovery of the first was finished before the second failed. Nonetheless, both problematic pgs have been on both OSDs. We think, that we still have enough shards left. For one of the pgs, the recovery state looks like this: "recovery_state": [ { "name": "Started/Primary/Peering/Incomplete", "enter_time": "2019-05-09 16:11:48.625966", "comment": "not enough complete instances of this PG" }, { "name": "Started/Primary/Peering", "enter_time": "2019-05-09 16:11:48.611171", "past_intervals": [ { "first": "49767", "last": "59313", "all_participants": [ { "osd": 2, "shard": 0 }, { "osd": 4, "shard": 1 }, { "osd": 23, "shard": 2 }, { "osd": 24, "shard": 0 }, { "osd": 72, "shard": 1 }, { "osd": 79, "shard": 3 } ], "intervals": [ { "first": "58860", "last": "58861", "acting": "4(1),24(0),79(3)" }, { "first": "58875", "last": "58877", "acting": "4(1),23(2),24(0)" }, { "first": "59002", "last": "59009", "acting": "4(1),23(2),79(3)" }, { "first": "59010", "last": "59012", "acting": "2(0),4(1),23(2),79(3)" }, { "first": "59197", "last": "59233", "acting": "23(2),24(0),79(3)" }, { "first": "59234", "last": "59313", "acting": "23(2),24(0),72(1),79(3)" } ] } ], "probing_osds": [ "2(0)", "4(1)", "23(2)", "24(0)", "72(1)", "79(3)" ], "down_osds_we_would_probe": [], "peering_blocked_by": [], "peering_blocked_by_detail": [ { "detail": "peering_blocked_by_history_les_bound" } ] }, { "name": "Started", "enter_time": "2019-05-09 16:11:48.611121" } ], Is there a chance to recover this pg from the shards on OSDs 2, 72, 79? ceph pg repair/deep-scrub/scrub did not work. repair/scrub are not related to this problem so they won't help. How exactly did you use the osd_find_best_info_ignore_history_les option? One correct procedure would be to set it to true in ceph.conf, then restart each of the probing_osd's above. (Once the PG has peered, you need to unset the option and restart those osds again). We executed ceph --admin-daemon /var/run/ceph/ceph-osd.X.asok config set osd_find_best_info_ignore_history_les true And then we restarted the affected OSDs. I guess this is doing the same, right?No that doesn't work. That just sets it in memory but then the option is reset to the default when you restart the OSD. You need to set it in ceph.conf on the OSD host. -- danWe are also worried about the behind on trimming of the mds or is this not too problematic? Trimming requires IO on PGs, and the mds is almost certainly stuck on those incomplete PGs. Solve the incomplete, and then address the MDS later if it doesn't resolve itself. -- dan ok, then we don't have to worry about this for now. Best regards, Kevin MDS_TRIM 1 MDSs behind on trimming mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46178/128) max_segments: 128, num_segments: 46178 Depending on the data stored (CephFS ?) you probably can recover most of it but some of it is irremediably lost. If you can recover the data from the failed OSD at the time they failed you might be able to recover some of your lost data (with the help of Ceph devs), if not there's nothing to do. In the later case I'd add a new server to use at least 3+2 for a fresh pool instead of 3+1 and begin moving the data to it. The 12.2 + 13.2 mix is a potential problem in addition to the one above but it's a different one. Best regards, Lionel The idea for the future is to set up a new ceph with 3+2 with 8 servers in total and of course with consistent versions on all nodes. Best regards, Kevin _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com