EC profile: https://pastebin.ubuntu.com/p/kjbdQXbs85/
ceph pg dump pgs | grep -v "active+clean":
https://pastebin.ubuntu.com/p/g6TdZXNXBR/
El 2020-10-28 02:23, Eugen Block escribió:
If you have that many spare hosts I would recommend to deploy two more
MONs on them, and probably also additional MGRs so they can failover.
What is the EC profile for the data_storage pool?
Can you also share
ceph pg dump pgs | grep -v "active+clean"
to see which PGs are affected.
The remaining issue with unfound objects and unkown PGs could be
because you removed OSDs. That could mean data loss, but maybe there's
a chance to recover anyway.
Zitat von "Ing. Luis Felipe Domínguez Vega" <luis.dominguez@xxxxxxxxx>:
Well recovering not working yet... i was started 6 servers more and
the cluster not yet recovered.
Ceph status not show any recover progress
ceph -s : https://pastebin.ubuntu.com/p/zRQPbvGzbw/
ceph osd tree : https://pastebin.ubuntu.com/p/sTDs8vd7Sk/
ceph osd df : https://pastebin.ubuntu.com/p/ysbh8r2VVz/
ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
crush rules : (ceph osd crush rule dump)
https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
El 2020-10-27 09:59, Eugen Block escribió:
Your pool 'data_storage' has a size of 7 (or 7 chunks since it's
erasure-coded) and the rule requires each chunk on a different host
but you currently have only 5 hosts available, that's why the
recovery
is not progressing. It's waiting for two more hosts. Unfortunately,
you can't change the EC profile or the rule of that pool. I'm not
sure
if it would work in the current cluster state, but if you can't add
two more hosts (which would be your best option for recovery) it
might
be possible to create a new replicated pool (you seem to have enough
free space) and copy the contents from that EC pool. But as I said,
I'm not sure if that would work in a degraded state, I've never tried
that.
So your best bet is to get two more hosts somehow.
pool 4 'data_storage' erasure profile desoft size 7 min_size 5
crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode off last_change 154384 lfor 0/121016/121014 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
application rbd
Zitat von "Ing. Luis Felipe Domínguez Vega"
<luis.dominguez@xxxxxxxxx>:
Needed data:
ceph -s : https://pastebin.ubuntu.com/p/S9gKjyZtdK/
ceph osd tree : https://pastebin.ubuntu.com/p/SCZHkk6Mk4/
ceph osd df : (later, because i'm waiting since 10
minutes and not output yet)
ceph osd pool ls detail : https://pastebin.ubuntu.com/p/GRdPjxhv3D/
crush rules : (ceph osd crush rule dump)
https://pastebin.ubuntu.com/p/cjyjmbQ4Wq/
El 2020-10-27 07:14, Eugen Block escribió:
I understand, but i delete the OSDs from CRUSH map, so ceph don't
wait for these OSDs, i'm right?
It depends on your actual crush tree and rules. Can you share
(maybe
you already did)
ceph osd tree
ceph osd df
ceph osd pool ls detail
and a dump of your crush rules?
As I already said, if you have rules in place that distribute data
across 2 DCs and one of them is down the PGs will never recover
even
if you delete the OSDs from the failed DC.
Zitat von "Ing. Luis Felipe Domínguez Vega"
<luis.dominguez@xxxxxxxxx>:
I understand, but i delete the OSDs from CRUSH map, so ceph don't
wait for these OSDs, i'm right?
El 2020-10-27 04:06, Eugen Block escribió:
Hi,
just to clarify so I don't miss anything: you have two DCs and
one of
them is down. And two of the MONs were in that failed DC? Now you
removed all OSDs and two MONs from the failed DC hoping that your
cluster will recover? If you have reasonable crush rules in place
(e.g. to recover from a failed DC) your cluster will never
recover in
the current state unless you bring OSDs back up on the second DC.
That's why you don't see progress in the recovery process, the
PGs are
waiting for their peers in the other DC so they can follow the
crush
rules.
Regards,
Eugen
Zitat von "Ing. Luis Felipe Domínguez Vega"
<luis.dominguez@xxxxxxxxx>:
I was 3 mons, but i have 2 physical datacenters, one of them
breaks with not short term fix, so i remove all osds and ceph
mon (2 of them) and now i have only the osds of 1 datacenter
with the monitor. I was stopped the ceph manager, but i was
see that when i restart a ceph manager then ceph -s show
recovering info for a short term of 20 min more or less, then
dissapear all info.
The thing is that sems the cluster is not self recovering and
the ceph monitor is "eating" all of the HDD.
El 2020-10-26 15:57, Eugen Block escribió:
The recovery process (ceph -s) is independent of the MGR
service but
only depends on the MON service. It seems you only have the one
MON,
if the MGR is overloading it (not clear why) it could help to
leave
MGR off and see if the MON service then has enough RAM to
proceed with
the recovery. Do you have any chance to add two more MONs? A
single
MON is of course a single point of failure.
Zitat von "Ing. Luis Felipe Domínguez Vega"
<luis.dominguez@xxxxxxxxx>:
El 2020-10-26 15:16, Eugen Block escribió:
You could stop the MGRs and wait for the recovery to finish,
MGRs are
not a critical component. You won’t have a dashboard or
metrics
during/of that time but it would prevent the high RAM usage.
Zitat von "Ing. Luis Felipe Domínguez Vega"
<luis.dominguez@xxxxxxxxx>:
El 2020-10-26 12:23, 胡 玮文 escribió:
在 2020年10月26日,23:29,Ing. Luis Felipe Domínguez Vega
<luis.dominguez@xxxxxxxxx> 写道:
mgr: fond-beagle(active, since 39s)
Your manager seems crash looping, it only started since
39s. Looking
at mgr logs may help you identify why your cluster is not
recovering.
You may hit some bug in mgr.
Noup, I'm restarting the ceph manager because they eat all
server RAM and then i have an script that when i have
1GB of Free Ram (the server has 94 Gb of RAM) then
restart the manager, i dont known why and the logs of
manager are:
-----------------------------------
root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db#
tail -f /var/log/ceph/ceph-mgr.fond-beagle.log
2020-10-26T12:54:12.497-0400 7f2a8112b700 0
log_channel(cluster) log [DBG] : pgmap v584: 2305 pgs: 4
active+undersized+degraded+remapped, 4
active+recovery_unfound+undersized+degraded+remapped, 2104
active+clean, 5 active+undersized+degraded, 34
incomplete, 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB avail; 347248/2606900 objects degraded
(13.320%); 107570/2606900 objects misplaced (4.126%);
19/404328 objects unfound (0.005%)
2020-10-26T12:54:12.497-0400 7f2a8112b700 0
log_channel(cluster) do_log log to syslog
2020-10-26T12:54:14.501-0400 7f2a8112b700 0
log_channel(cluster) log [DBG] : pgmap v585: 2305 pgs: 4
active+undersized+degraded+remapped, 4
active+recovery_unfound+undersized+degraded+remapped, 2104
active+clean, 5 active+undersized+degraded, 34
incomplete, 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB avail; 347248/2606900 objects degraded
(13.320%); 107570/2606900 objects misplaced (4.126%);
19/404328 objects unfound (0.005%)
2020-10-26T12:54:14.501-0400 7f2a8112b700 0
log_channel(cluster) do_log log to syslog
2020-10-26T12:54:16.517-0400 7f2a8112b700 0
log_channel(cluster) log [DBG] : pgmap v586: 2305 pgs: 4
active+undersized+degraded+remapped, 4
active+recovery_unfound+undersized+degraded+remapped, 2104
active+clean, 5 active+undersized+degraded, 34
incomplete, 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB avail; 347248/2606900 objects degraded
(13.320%); 107570/2606900 objects misplaced (4.126%);
19/404328 objects unfound (0.005%)
2020-10-26T12:54:16.517-0400 7f2a8112b700 0
log_channel(cluster) do_log log to syslog
2020-10-26T12:54:18.521-0400 7f2a8112b700 0
log_channel(cluster) log [DBG] : pgmap v587: 2305 pgs: 4
active+undersized+degraded+remapped, 4
active+recovery_unfound+undersized+degraded+remapped, 2104
active+clean, 5 active+undersized+degraded, 34
incomplete, 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB avail; 347248/2606900 objects degraded
(13.320%); 107570/2606900 objects misplaced (4.126%);
19/404328 objects unfound (0.005%)
2020-10-26T12:54:18.521-0400 7f2a8112b700 0
log_channel(cluster) do_log log to syslog
2020-10-26T12:54:20.537-0400 7f2a8112b700 0
log_channel(cluster) log [DBG] : pgmap v588: 2305 pgs: 4
active+undersized+degraded+remapped, 4
active+recovery_unfound+undersized+degraded+remapped, 2104
active+clean, 5 active+undersized+degraded, 34
incomplete, 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB avail; 347248/2606900 objects degraded
(13.320%); 107570/2606900 objects misplaced (4.126%);
19/404328 objects unfound (0.005%)
2020-10-26T12:54:20.537-0400 7f2a8112b700 0
log_channel(cluster) do_log log to syslog
2020-10-26T12:54:22.541-0400 7f2a8112b700 0
log_channel(cluster) log [DBG] : pgmap v589: 2305 pgs: 4
active+undersized+degraded+remapped, 4
active+recovery_unfound+undersized+degraded+remapped, 2104
active+clean, 5 active+undersized+degraded, 34
incomplete, 154 unknown; 1.7 TiB data, 2.9 TiB used,
21 TiB / 24 TiB avail; 347248/2606900 objects degraded
(13.320%); 107570/2606900 objects misplaced (4.126%);
19/404328 objects unfound (0.005%)
2020-10-26T12:54:22.541-0400 7f2a8112b700 0
log_channel(cluster) do_log log to syslog
---------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
Ok i will do that... but the thing is that the cluster not
show recovering, not show that are doing nothing, like to
show the recovering info on ceph -s command, and then i
dont know if is recovering or doing what?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx