Re: Huge HDD ceph monitor usage [EXT]

Ing. Luis Felipe Domínguez Vega <luis.dominguez@xxxxxxxxx> · Mon, 26 Oct 2020 18:08:46 -0400

The ceph mon logs... many of this unstoppable on my log:

------------------------------------------------------
2020-10-26T15:40:28.875729-0400 osd.23 [WRN] slow request 
osd_op(client.86168166.0:9023356 5.56 5.1cd5a6d6 (undecoded) 
ondisk+retry+write+known_if_redirected e159644) initiated 
2020-10-26T15:57:51.597394+0000 currently queued for pg
2020-10-26T15:40:28.875745-0400 osd.23 [WRN] slow request 
osd_op(client.86168166.0:9071950 5.56 5.1cd5a6d6 (undecoded) 
ondisk+retry+write+known_if_redirected e159644) initiated 
2020-10-26T15:57:51.599033+0000 currently queued for pg
2020-10-26T15:40:28.875761-0400 osd.23 [WRN] slow request 
osd_op(client.86168166.0:9078184 5.56 5.1cd5a6d6 (undecoded) 
ondisk+retry+write+known_if_redirected e159644) initiated 
2020-10-26T15:57:51.600244+0000 currently queued for pg
2020-10-26T15:40:28.875781-0400 osd.23 [WRN] slow request 
osd_op(client.86168166.0:9130749 5.56 5.1cd5a6d6 (undecoded) 
ondisk+write+known_if_redirected e159652) initiated 
2020-10-26T15:58:36.457562+0000 currently queued for pg
2020-10-26T15:40:28.878905-0400 osd.23 [WRN] slow request 
osd_op(client.86168166.0:9130780 5.56 5.1cd5a6d6 (undecoded) 
ondisk+write+known_if_redirected e159653) initiated 
2020-10-26T16:01:11.470983+0000 currently queued for pg
2020-10-26T15:40:28.878936-0400 osd.23 [WRN] slow request 
osd_op(client.86168166.0:9130812 5.56 5.1cd5a6d6 (undecoded) 
ondisk+write+known_if_redirected e159653) initiated 
2020-10-26T16:03:51.480523+0000 currently queued for pg
------------------------------------------------------------

El 2020-10-26 15:57, Eugen Block escribió:
The recovery process (ceph -s) is independent of the MGR service but
only depends on the MON service. It seems you only have the one MON,
if the MGR is overloading it (not clear why) it could help to leave
MGR off and see if the MON service then has enough RAM to proceed with
 the recovery. Do you have any chance to add two more MONs? A single
MON is of course a single point of failure.

Zitat von "Ing. Luis Felipe Domínguez Vega" <luis.dominguez@xxxxxxxxx>:

El 2020-10-26 15:16, Eugen Block escribió:
You could stop the MGRs and wait for the recovery to finish, MGRs are
not a critical component. You won’t have a dashboard or metrics
during/of that time but it would prevent the high RAM usage.

Zitat von "Ing. Luis Felipe Domínguez Vega" 
<luis.dominguez@xxxxxxxxx>:

El 2020-10-26 12:23, 胡 玮文 escribió:
在 2020年10月26日，23:29，Ing. Luis Felipe Domínguez Vega   
<luis.dominguez@xxxxxxxxx> 写道：

mgr: fond-beagle(active, since 39s)

Your manager seems crash looping, it only started since 39s. 
Looking
at mgr logs may help you identify why your cluster is not 
recovering.
You may hit some bug in mgr.
Noup, I'm restarting the ceph manager because they eat all server   
RAM and then i have an script that when i have 1GB of Free Ram  (the 
 server has 94 Gb of RAM) then restart the manager, i dont  known 
why  and the logs of manager are:

-----------------------------------
root@fond-beagle:/var/lib/ceph/mon/ceph-fond-beagle/store.db# tail   
-f /var/log/ceph/ceph-mgr.fond-beagle.log
2020-10-26T12:54:12.497-0400 7f2a8112b700  0 log_channel(cluster)   
log [DBG] : pgmap v584: 2305 pgs: 4   
active+undersized+degraded+remapped, 4   
active+recovery_unfound+undersized+degraded+remapped, 2104   
active+clean, 5 active+undersized+degraded, 34 incomplete, 154   
unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;   
347248/2606900 objects degraded (13.320%); 107570/2606900 objects   
misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:12.497-0400 7f2a8112b700  0 log_channel(cluster)   
do_log log to syslog
2020-10-26T12:54:14.501-0400 7f2a8112b700  0 log_channel(cluster)   
log [DBG] : pgmap v585: 2305 pgs: 4   
active+undersized+degraded+remapped, 4   
active+recovery_unfound+undersized+degraded+remapped, 2104   
active+clean, 5 active+undersized+degraded, 34 incomplete, 154   
unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;   
347248/2606900 objects degraded (13.320%); 107570/2606900 objects   
misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:14.501-0400 7f2a8112b700  0 log_channel(cluster)   
do_log log to syslog
2020-10-26T12:54:16.517-0400 7f2a8112b700  0 log_channel(cluster)   
log [DBG] : pgmap v586: 2305 pgs: 4   
active+undersized+degraded+remapped, 4   
active+recovery_unfound+undersized+degraded+remapped, 2104   
active+clean, 5 active+undersized+degraded, 34 incomplete, 154   
unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;   
347248/2606900 objects degraded (13.320%); 107570/2606900 objects   
misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:16.517-0400 7f2a8112b700  0 log_channel(cluster)   
do_log log to syslog
2020-10-26T12:54:18.521-0400 7f2a8112b700  0 log_channel(cluster)   
log [DBG] : pgmap v587: 2305 pgs: 4   
active+undersized+degraded+remapped, 4   
active+recovery_unfound+undersized+degraded+remapped, 2104   
active+clean, 5 active+undersized+degraded, 34 incomplete, 154   
unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;   
347248/2606900 objects degraded (13.320%); 107570/2606900 objects   
misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:18.521-0400 7f2a8112b700  0 log_channel(cluster)   
do_log log to syslog
2020-10-26T12:54:20.537-0400 7f2a8112b700  0 log_channel(cluster)   
log [DBG] : pgmap v588: 2305 pgs: 4   
active+undersized+degraded+remapped, 4   
active+recovery_unfound+undersized+degraded+remapped, 2104   
active+clean, 5 active+undersized+degraded, 34 incomplete, 154   
unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;   
347248/2606900 objects degraded (13.320%); 107570/2606900 objects   
misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:20.537-0400 7f2a8112b700  0 log_channel(cluster)   
do_log log to syslog
2020-10-26T12:54:22.541-0400 7f2a8112b700  0 log_channel(cluster)   
log [DBG] : pgmap v589: 2305 pgs: 4   
active+undersized+degraded+remapped, 4   
active+recovery_unfound+undersized+degraded+remapped, 2104   
active+clean, 5 active+undersized+degraded, 34 incomplete, 154   
unknown; 1.7 TiB data, 2.9 TiB used, 21 TiB / 24 TiB avail;   
347248/2606900 objects degraded (13.320%); 107570/2606900 objects   
misplaced (4.126%); 19/404328 objects unfound (0.005%)
2020-10-26T12:54:22.541-0400 7f2a8112b700  0 log_channel(cluster)   
do_log log to syslog
---------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Ok i will do that... but the thing is that the cluster not show  
recovering, not show that are doing nothing, like to show the  
recovering info on ceph -s command, and then i dont know if is  
recovering or doing what?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx