Thanks, I’ll try that tomorrow. > On 25. Feb 2021, at 21:59, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > Maybe the debugging steps in that insights tracker can be helpful > anyway: https://tracker.ceph.com/issues/39955 > > -- dan > > On Thu, Feb 25, 2021 at 9:27 PM Janek Bevendorff > <janek.bevendorff@xxxxxxxxxxxxx> wrote: >> >> Thanks for the tip, but I do not have degraded PGs and the module is already disabled. >> >> >> On 25. Feb 2021, at 21:17, Seena Fallah <seenafallah@xxxxxxxxx> wrote: >> >> I had the same problem in my cluster and it was because of insights mgr module that was storing lots of data to the RocksDB because mu cluster was degraded. >> If you have degraded pgs try to disable insights module. >> >> On Thu, Feb 25, 2021 at 11:40 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >>> >>>> "source": "osd.104... >>> >>> What's happening on that osd? Is it something new which corresponds to when >>> your mon started growing? Are other OSDs also flooding the mons with logs? >>> >>> I'm mobile so can't check... Are those logging configs the defaults? If not >>> .... revert to default... >>> >>> BTW do your mons have stable quorum or are they flapping with this load? >>> >>> .. dan >>> >>> >>> >>> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff < >>> janek.bevendorff@xxxxxxxxxxxxx> wrote: >>> >>>> Thanks, Dan. >>>> >>>> On the first MON, the command doesn’t even return, but I was able to get a >>>> dump from the one I restarted most recently. The oldest ops look like this: >>>> >>>> { >>>> "description": "log(1000 entries from seq 17876238 at >>>> 2021-02-25T15:13:20.306487+0100)", >>>> "initiated_at": "2021-02-25T20:40:34.698932+0100", >>>> "age": 183.762551121, >>>> "duration": 183.762599201, >>>> "type_data": { >>>> "events": [ >>>> { >>>> "time": "2021-02-25T20:40:34.698932+0100", >>>> "event": "initiated" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.698636+0100", >>>> "event": "throttled" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.698932+0100", >>>> "event": "header_read" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.701407+0100", >>>> "event": "all_read" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.701455+0100", >>>> "event": "dispatched" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.701458+0100", >>>> "event": "mon:_ms_dispatch" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.701459+0100", >>>> "event": "mon:dispatch_op" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.701459+0100", >>>> "event": "psvc:dispatch" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.701490+0100", >>>> "event": "logm:wait_for_readable" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.701491+0100", >>>> "event": "logm:wait_for_readable/paxos" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.701496+0100", >>>> "event": "paxos:wait_for_readable" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.989198+0100", >>>> "event": "callback finished" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.989199+0100", >>>> "event": "psvc:dispatch" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.989208+0100", >>>> "event": "logm:preprocess_query" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.989208+0100", >>>> "event": "logm:preprocess_log" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.989278+0100", >>>> "event": "forward_request_leader" >>>> }, >>>> { >>>> "time": "2021-02-25T20:40:34.989344+0100", >>>> "event": "forwarded" >>>> }, >>>> { >>>> "time": "2021-02-25T20:41:58.658022+0100", >>>> "event": "resend forwarded message to leader" >>>> }, >>>> { >>>> "time": "2021-02-25T20:42:27.735449+0100", >>>> "event": "resend forwarded message to leader" >>>> } >>>> ], >>>> "info": { >>>> "seq": 41550, >>>> "src_is_mon": false, >>>> "source": "osd.104 v2:XXX:6864/16579", >>>> "forwarded_to_leader": true >>>> } >>>> >>>> >>>> Any idea what that might be about? Almost looks like this: >>>> https://tracker.ceph.com/issues/24180 >>>> I set debug_mon to 0, but I keep getting a lot of log spill in journals. >>>> It’s about 1-2 messages per second, mostly RocksDB stuff, but nothing that >>>> actually looks serious or even log-worthy. I noticed that before that >>>> despite logging being set to warning level, the cluster log keeps being >>>> written to the MON log. But it shouldn’t cause such massive stability >>>> issues, should it? The date on the log op is also weird. 15:13+0100 was >>>> hours ago. >>>> >>>> Here’s my log config: >>>> >>>> global advanced clog_to_syslog_level >>>> warning >>>> global basic err_to_syslog >>>> true >>>> global basic log_to_file >>>> false >>>> global basic log_to_stderr >>>> false >>>> global basic log_to_syslog >>>> true >>>> global advanced mon_cluster_log_file_level >>>> error >>>> global advanced mon_cluster_log_to_file >>>> false >>>> global advanced mon_cluster_log_to_stderr >>>> false >>>> global advanced mon_cluster_log_to_syslog >>>> false >>>> global advanced >>>> mon_cluster_log_to_syslog_level warning >>>> >>>> >>>> >>>> Ceph version is 15.2.8. >>>> >>>> Janek >>>> >>>> >>>> On 25. Feb 2021, at 20:33, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: >>>> >>>> ceph daemon mon.`hostname -s` ops >>>> >>>> That should show you the accumulating ops. >>>> >>>> .. dan >>>> >>>> >>>> On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff < >>>> janek.bevendorff@xxxxxxxxxxxxx> wrote: >>>> >>>>> Hi, >>>>> >>>>> All of a sudden, we are experiencing very concerning MON behaviour. We >>>>> have five MONs and all of them have thousands up to tens of thousands of >>>>> slow ops, the oldest one blocking basically indefinitely (at least the >>>>> timer keeps creeping up). Additionally, the MON stores keep inflating >>>>> heavily. Under normal circumstances we have about 450-550MB there. Right >>>>> now its 27GB and growing (rapidly). >>>>> >>>>> I tried restarting all MONs, I disabled auto-scaling (just in case) and >>>>> checked the system load and hardware. I also restarted the MGR and MDS >>>>> daemons, but to no avail. >>>>> >>>>> Is there any way I can debug this properly? I can’t seem to find how I >>>>> can actually view what ops are causing this and what client (if any) may be >>>>> responsible for it. >>>>> >>>>> Thanks >>>>> Janek >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>> >>>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx