Re: MON slow ops and growing MON store

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Thu, 25 Feb 2021 21:26:57 +0100

Thanks for the tip, but I do not have degraded PGs and the module is already disabled.

> On 25. Feb 2021, at 21:17, Seena Fallah <seenafallah@xxxxxxxxx> wrote:
> 
> I had the same problem in my cluster and it was because of insights mgr module that was storing lots of data to the RocksDB because mu cluster was degraded. 
> If you have degraded pgs try to disable insights module.
> 
> On Thu, Feb 25, 2021 at 11:40 PM Dan van der Ster <dan@xxxxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxxxx>> wrote:
> > "source": "osd.104...
> 
> What's happening on that osd? Is it something new which corresponds to when
> your mon started growing? Are other OSDs also flooding the mons with logs?
> 
> I'm mobile so can't check... Are those logging configs the defaults? If not
> .... revert to default...
> 
> BTW do your mons have stable quorum or are they flapping with this load?
> 
> .. dan
> 
> 
> 
> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff <
> janek.bevendorff@xxxxxxxxxxxxx <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
> 
> > Thanks, Dan.
> >
> > On the first MON, the command doesn’t even return, but I was able to get a
> > dump from the one I restarted most recently. The oldest ops look like this:
> >
> >         {
> >             "description": "log(1000 entries from seq 17876238 at
> > 2021-02-25T15:13:20.306487+0100)",
> >             "initiated_at": "2021-02-25T20:40:34.698932+0100",
> >             "age": 183.762551121,
> >             "duration": 183.762599201,
> >             "type_data": {
> >                 "events": [
> >                     {
> >                         "time": "2021-02-25T20:40:34.698932+0100",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.698636+0100",
> >                         "event": "throttled"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.698932+0100",
> >                         "event": "header_read"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.701407+0100",
> >                         "event": "all_read"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.701455+0100",
> >                         "event": "dispatched"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.701458+0100",
> >                         "event": "mon:_ms_dispatch"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.701459+0100",
> >                         "event": "mon:dispatch_op"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.701459+0100",
> >                         "event": "psvc:dispatch"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.701490+0100",
> >                         "event": "logm:wait_for_readable"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.701491+0100",
> >                         "event": "logm:wait_for_readable/paxos"
> >                     },
> >                                         {
> >                         "time": "2021-02-25T20:40:34.701496+0100",
> >                         "event": "paxos:wait_for_readable"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.989198+0100",
> >                         "event": "callback finished"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.989199+0100",
> >                         "event": "psvc:dispatch"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.989208+0100",
> >                         "event": "logm:preprocess_query"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.989208+0100",
> >                         "event": "logm:preprocess_log"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.989278+0100",
> >                         "event": "forward_request_leader"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:40:34.989344+0100",
> >                         "event": "forwarded"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:41:58.658022+0100",
> >                         "event": "resend forwarded message to leader"
> >                     },
> >                     {
> >                         "time": "2021-02-25T20:42:27.735449+0100",
> >                         "event": "resend forwarded message to leader"
> >                     }
> >                 ],
> >                 "info": {
> >                     "seq": 41550,
> >                     "src_is_mon": false,
> >                     "source": "osd.104 v2:XXX:6864/16579",
> >                     "forwarded_to_leader": true
> >                 }
> >
> >
> > Any idea what that might be about? Almost looks like this:
> > https://tracker.ceph.com/issues/24180 <https://tracker.ceph.com/issues/24180>
> > I set debug_mon to 0, but I keep getting a lot of log spill in journals.
> > It’s about 1-2 messages per second, mostly RocksDB stuff, but nothing that
> > actually looks serious or even log-worthy. I noticed that before that
> > despite logging being set to warning level, the cluster log keeps being
> > written to the MON log. But it shouldn’t cause such massive stability
> > issues, should it? The date on the log op is also weird. 15:13+0100 was
> > hours ago.
> >
> > Here’s my log config:
> >
> > global                            advanced  clog_to_syslog_level
> >                       warning
> > global                            basic     err_to_syslog
> >                        true
> > global                            basic     log_to_file
> >                        false
> > global                            basic     log_to_stderr
> >                        false
> > global                            basic     log_to_syslog
> >                        true
> > global                            advanced  mon_cluster_log_file_level
> >                       error
> > global                            advanced  mon_cluster_log_to_file
> >                        false
> > global                            advanced  mon_cluster_log_to_stderr
> >                        false
> > global                            advanced  mon_cluster_log_to_syslog
> >                        false
> > global                            advanced
> >  mon_cluster_log_to_syslog_level                      warning
> >
> >
> >
> > Ceph version is 15.2.8.
> >
> > Janek
> >
> >
> > On 25. Feb 2021, at 20:33, Dan van der Ster <dan@xxxxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxxxx>> wrote:
> >
> > ceph daemon mon.`hostname -s` ops
> >
> > That should show you the accumulating ops.
> >
> > .. dan
> >
> >
> > On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff <
> > janek.bevendorff@xxxxxxxxxxxxx <mailto:janek.bevendorff@xxxxxxxxxxxxx>> wrote:
> >
> >> Hi,
> >>
> >> All of a sudden, we are experiencing very concerning MON behaviour. We
> >> have five MONs and all of them have thousands up to tens of thousands of
> >> slow ops, the oldest one blocking basically indefinitely (at least the
> >> timer keeps creeping up). Additionally, the MON stores keep inflating
> >> heavily. Under normal circumstances we have about 450-550MB there. Right
> >> now its 27GB and growing (rapidly).
> >>
> >> I tried restarting all MONs, I disabled auto-scaling (just in case) and
> >> checked the system load and hardware. I also restarted the MGR and MDS
> >> daemons, but to no avail.
> >>
> >> Is there any way I can debug this properly? I can’t seem to find how I
> >> can actually view what ops are causing this and what client (if any) may be
> >> responsible for it.
> >>
> >> Thanks
> >> Janek
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
> >>
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx