We tried to export the shards from the OSDs but there are only
two shards left for each of the pgs, so we decided to give up
these pgs. Will the files of these pgs be deleted from the mds or
do we have to delete them manually. Is this the correct command to
mark the pgs as lost:
ceph pg {pg-id} mark_unfound_lost revert|delete
Cheers,
Kevin
On 15.05.19 8:55 vorm., Kevin Flöh
wrote:
The hdds
of OSDs 4 and 23 are completely lost, we cannot access them in any
way. Is it possible to use the shards which are maybe stored on
working OSDs as shown in the all_participants list?
On 14.05.19 5:24 nachm., Dan van der Ster wrote:
On Tue, May 14, 2019 at 5:13 PM Kevin Flöh
<kevin.floeh@xxxxxxx> wrote:
ok, so now we see at least a diffrence
in the recovery state:
"recovery_state": [
{
"name": "Started/Primary/Peering/Incomplete",
"enter_time": "2019-05-14 14:15:15.650517",
"comment": "not enough complete instances of
this PG"
},
{
"name": "Started/Primary/Peering",
"enter_time": "2019-05-14 14:15:15.243756",
"past_intervals": [
{
"first": "49767",
"last": "59580",
"all_participants": [
{
"osd": 2,
"shard": 0
},
{
"osd": 4,
"shard": 1
},
{
"osd": 23,
"shard": 2
},
{
"osd": 24,
"shard": 0
},
{
"osd": 72,
"shard": 1
},
{
"osd": 79,
"shard": 3
}
],
"intervals": [
{
"first": "59562",
"last": "59563",
"acting": "4(1),24(0),79(3)"
},
{
"first": "59564",
"last": "59567",
"acting": "23(2),24(0),79(3)"
},
{
"first": "59570",
"last": "59574",
"acting": "4(1),23(2),79(3)"
},
{
"first": "59577",
"last": "59580",
"acting": "4(1),23(2),24(0)"
}
]
}
],
"probing_osds": [
"2(0)",
"4(1)",
"23(2)",
"24(0)",
"72(1)",
"79(3)"
],
"down_osds_we_would_probe": [],
"peering_blocked_by": []
},
{
"name": "Started",
"enter_time": "2019-05-14 14:15:15.243663"
}
],
the peering does not seem to be blocked anymore. But still
there is no
recovery going on. Is there anything else we can try?
What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG
shards
and import to another operable OSD.
-- dan
On 14.05.19 11:02 vorm., Dan van der Ster wrote:
On Tue, May 14, 2019 at 10:59 AM Kevin
Flöh <kevin.floeh@xxxxxxx> wrote:
On 14.05.19 10:08 vorm., Dan van der
Ster wrote:
On Tue, May 14, 2019 at 10:02 AM Kevin Flöh
<kevin.floeh@xxxxxxx> wrote:
On 13.05.19 10:51 nachm., Lionel Bouton wrote:
Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
Dear ceph experts,
[...] We have 4 nodes with 24 osds each and use 3+1
erasure coding. [...]
Here is what happened: One osd daemon could not be started
and
therefore we decided to mark the osd as lost and set it up
from
scratch. Ceph started recovering and then we lost another
osd with
the same behavior. We did the same as for the first osd.
With 3+1 you only allow a single OSD failure per pg at a
given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the
same time on 2
separate servers (assuming standard crush rules) is a
death sentence
for the data on some pgs using both of those OSD (the ones
not fully
recovered before the second failure).
OK, so the 2 OSDs (4,23) failed shortly one after the
other but we think
that the recovery of the first was finished before the
second failed.
Nonetheless, both problematic pgs have been on both OSDs.
We think, that
we still have enough shards left. For one of the pgs, the
recovery state
looks like this:
"recovery_state": [
{
"name":
"Started/Primary/Peering/Incomplete",
"enter_time": "2019-05-09 16:11:48.625966",
"comment": "not enough complete instances
of this PG"
},
{
"name": "Started/Primary/Peering",
"enter_time": "2019-05-09 16:11:48.611171",
"past_intervals": [
{
"first": "49767",
"last": "59313",
"all_participants": [
{
"osd": 2,
"shard": 0
},
{
"osd": 4,
"shard": 1
},
{
"osd": 23,
"shard": 2
},
{
"osd": 24,
"shard": 0
},
{
"osd": 72,
"shard": 1
},
{
"osd": 79,
"shard": 3
}
],
"intervals": [
{
"first": "58860",
"last": "58861",
"acting":
"4(1),24(0),79(3)"
},
{
"first": "58875",
"last": "58877",
"acting":
"4(1),23(2),24(0)"
},
{
"first": "59002",
"last": "59009",
"acting":
"4(1),23(2),79(3)"
},
{
"first": "59010",
"last": "59012",
"acting":
"2(0),4(1),23(2),79(3)"
},
{
"first": "59197",
"last": "59233",
"acting":
"23(2),24(0),79(3)"
},
{
"first": "59234",
"last": "59313",
"acting":
"23(2),24(0),72(1),79(3)"
}
]
}
],
"probing_osds": [
"2(0)",
"4(1)",
"23(2)",
"24(0)",
"72(1)",
"79(3)"
],
"down_osds_we_would_probe": [],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
"detail":
"peering_blocked_by_history_les_bound"
}
]
},
{
"name": "Started",
"enter_time": "2019-05-09 16:11:48.611121"
}
],
Is there a chance to recover this pg from the shards on
OSDs 2, 72, 79?
ceph pg repair/deep-scrub/scrub did not work.
repair/scrub are not related to this problem so they won't
help.
How exactly did you use the
osd_find_best_info_ignore_history_les option?
One correct procedure would be to set it to true in
ceph.conf, then
restart each of the probing_osd's above.
(Once the PG has peered, you need to unset the option and
restart
those osds again).
We executed ceph --admin-daemon
/var/run/ceph/ceph-osd.X.asok config set
osd_find_best_info_ignore_history_les true
And then we restarted the affected OSDs. I guess this is
doing the same, right?
No that doesn't work. That just sets it in memory but then
the option
is reset to the default when you restart the OSD.
You need to set it in ceph.conf on the OSD host.
-- dan
We are also worried about the behind
on trimming of the mds or is this
not too problematic?
Trimming requires IO on PGs, and the mds is almost
certainly stuck on
those incomplete PGs.
Solve the incomplete, and then address the MDS later if it
doesn't
resolve itself.
-- dan
ok, then we don't have to worry about this for now.
Best regards,
Kevin
MDS_TRIM 1 MDSs behind on trimming
mdsceph-node02.etp.kit.edu(mds.0): Behind on
trimming (46178/128)
max_segments: 128, num_segments: 46178
Depending on the data stored (CephFS ?) you probably can
recover most
of it but some of it is irremediably lost.
If you can recover the data from the failed OSD at the
time they
failed you might be able to recover some of your lost data
(with the
help of Ceph devs), if not there's nothing to do.
In the later case I'd add a new server to use at least 3+2
for a fresh
pool instead of 3+1 and begin moving the data to it.
The 12.2 + 13.2 mix is a potential problem in addition to
the one
above but it's a different one.
Best regards,
Lionel
The idea for the future is to set up a new ceph with 3+2
with 8 servers
in total and of course with consistent versions on all
nodes.
Best regards,
Kevin
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|