Re: Huge HDD ceph monitor usage [EXT]

Eugen Block <eblock@xxxxxx> · Tue, 27 Oct 2020 11:14:31 +0000

I understand, but i delete the OSDs from CRUSH map, so ceph don't  
wait for these OSDs, i'm right?

It depends on your actual crush tree and rules. Can you share (maybe  
you already did)

ceph osd tree
ceph osd df
ceph osd pool ls detail

and a dump of your crush rules?

As I already said, if you have rules in place that distribute data  
across 2 DCs and one of them is down the PGs will never recover even  
if you delete the OSDs from the failed DC.

Zitat von "Ing. Luis Felipe Domínguez Vega" <luis.dominguez@xxxxxxxxx>:

I understand, but i delete the OSDs from CRUSH map, so ceph don't  
wait for these OSDs, i'm right?

El 2020-10-27 04:06, Eugen Block escribió:
Hi,

just to clarify so I don't miss anything: you have two DCs and one of
them is down. And two of the MONs were in that failed DC? Now you
removed all OSDs and two MONs from the failed DC hoping that your
cluster will recover? If you have reasonable crush rules in place
(e.g. to recover from a failed DC) your cluster will never recover in
the current state unless you bring OSDs back up on the second DC.
That's why you don't see progress in the recovery process, the PGs are
waiting for their peers in the other DC so they can follow the crush
rules.

Regards,
Eugen

Zitat von "Ing. Luis Felipe Domínguez Vega" <luis.dominguez@xxxxxxxxx>:

I was 3 mons, but i have 2 physical datacenters, one of them  
breaks  with not short term fix, so i remove all osds and ceph mon  
(2 of  them) and now i have only the osds of 1 datacenter with the  
monitor.  I was stopped the ceph manager, but i was see that when  
i restart a  ceph manager then ceph -s show recovering info for a  
short term of  20 min more or less, then dissapear all info.

The thing is that sems the cluster is not self recovering and the   
ceph monitor is "eating" all of the HDD.

El 2020-10-26 15:57, Eugen Block escribió:
The recovery process (ceph -s) is independent of the MGR service but
only depends on the MON service. It seems you only have the one MON,
if the MGR is overloading it (not clear why) it could help to leave
MGR off and see if the MON service then has enough RAM to proceed with
the recovery. Do you have any chance to add two more MONs? A single
MON is of course a single point of failure.

Zitat von "Ing. Luis Felipe Domínguez Vega" <luis.dominguez@xxxxxxxxx>:

El 2020-10-26 15:16, Eugen Block escribió:
You could stop the MGRs and wait for the recovery to finish, MGRs are
not a critical component. You won’t have a dashboard or metrics
during/of that time but it would prevent the high RAM usage.

Zitat von "Ing. Luis Felipe Domínguez Vega" <luis.dominguez@xxxxxxxxx>:

El 2020-10-26 12:23, 胡 玮文 escribió:
在 2020年10月26日，23:29，Ing. Luis Felipe Domínguez Vega     
<luis.dominguez@xxxxxxxxx> 写道：

mgr: fond-beagle(active, since 39s)

Your manager seems crash looping, it only started since 39s. Looking
at mgr logs may help you identify why your cluster is not recovering.
You may hit some bug in mgr.
Noup, I'm restarting the ceph manager because they eat all   
server   RAM and then i have an script that when i have 1GB of  
 Free Ram  (the  server has 94 Gb of RAM) then restart the   
manager, i dont  known why  and the logs of manager are:

-----------------------------------
root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db#   
tail   -f /var/log/ceph/ceph-mgr.fond-beagle.log
2020-10-26T12:54:12.497-0400 7f2a8112b700  0   
log_channel(cluster)   log [DBG] : pgmap v584: 2305 pgs: 4     
active+undersized+degraded+remapped, 4     
active+recovery_unfound+undersized+degraded+remapped, 2104     
active+clean, 5 active+undersized+degraded, 34 incomplete, 154  
   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;  
   347248/2606900 objects degraded (13.320%); 107570/2606900   
objects   misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:12.497-0400 7f2a8112b700  0   
log_channel(cluster)   do_log log to syslog
2020-10-26T12:54:14.501-0400 7f2a8112b700  0   
log_channel(cluster)   log [DBG] : pgmap v585: 2305 pgs: 4     
active+undersized+degraded+remapped, 4     
active+recovery_unfound+undersized+degraded+remapped, 2104     
active+clean, 5 active+undersized+degraded, 34 incomplete, 154  
   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;  
   347248/2606900 objects degraded (13.320%); 107570/2606900   
objects   misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:14.501-0400 7f2a8112b700  0   
log_channel(cluster)   do_log log to syslog
2020-10-26T12:54:16.517-0400 7f2a8112b700  0   
log_channel(cluster)   log [DBG] : pgmap v586: 2305 pgs: 4     
active+undersized+degraded+remapped, 4     
active+recovery_unfound+undersized+degraded+remapped, 2104     
active+clean, 5 active+undersized+degraded, 34 incomplete, 154  
   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;  
   347248/2606900 objects degraded (13.320%); 107570/2606900   
objects   misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:16.517-0400 7f2a8112b700  0   
log_channel(cluster)   do_log log to syslog
2020-10-26T12:54:18.521-0400 7f2a8112b700  0   
log_channel(cluster)   log [DBG] : pgmap v587: 2305 pgs: 4     
active+undersized+degraded+remapped, 4     
active+recovery_unfound+undersized+degraded+remapped, 2104     
active+clean, 5 active+undersized+degraded, 34 incomplete, 154  
   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;  
   347248/2606900 objects degraded (13.320%); 107570/2606900   
objects   misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:18.521-0400 7f2a8112b700  0   
log_channel(cluster)   do_log log to syslog
2020-10-26T12:54:20.537-0400 7f2a8112b700  0   
log_channel(cluster)   log [DBG] : pgmap v588: 2305 pgs: 4     
active+undersized+degraded+remapped, 4     
active+recovery_unfound+undersized+degraded+remapped, 2104     
active+clean, 5 active+undersized+degraded, 34 incomplete, 154  
   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;  
   347248/2606900 objects degraded (13.320%); 107570/2606900   
objects   misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:20.537-0400 7f2a8112b700  0   
log_channel(cluster)   do_log log to syslog
2020-10-26T12:54:22.541-0400 7f2a8112b700  0   
log_channel(cluster)   log [DBG] : pgmap v589: 2305 pgs: 4     
active+undersized+degraded+remapped, 4     
active+recovery_unfound+undersized+degraded+remapped, 2104     
active+clean, 5 active+undersized+degraded, 34 incomplete, 154  
   unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;  
   347248/2606900 objects degraded (13.320%); 107570/2606900   
objects   misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:22.541-0400 7f2a8112b700  0   
log_channel(cluster)   do_log log to syslog
---------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Ok i will do that... but the thing is that the cluster not show   
 recovering, not show that are doing nothing, like to show the    
recovering info on ceph -s command, and then i dont know if is    
recovering or doing what?

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx