Re: MON slow ops and growing MON store

Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> · Thu, 25 Feb 2021 22:03:50 +0100

Thanks, I’ll try that tomorrow.

> On 25. Feb 2021, at 21:59, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> 
> Maybe the debugging steps in that insights tracker can be helpful
> anyway: https://tracker.ceph.com/issues/39955
> 
> -- dan
> 
> On Thu, Feb 25, 2021 at 9:27 PM Janek Bevendorff
> <janek.bevendorff@xxxxxxxxxxxxx> wrote:
>> 
>> Thanks for the tip, but I do not have degraded PGs and the module is already disabled.
>> 
>> 
>> On 25. Feb 2021, at 21:17, Seena Fallah <seenafallah@xxxxxxxxx> wrote:
>> 
>> I had the same problem in my cluster and it was because of insights mgr module that was storing lots of data to the RocksDB because mu cluster was degraded.
>> If you have degraded pgs try to disable insights module.
>> 
>> On Thu, Feb 25, 2021 at 11:40 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>> 
>>>> "source": "osd.104...
>>> 
>>> What's happening on that osd? Is it something new which corresponds to when
>>> your mon started growing? Are other OSDs also flooding the mons with logs?
>>> 
>>> I'm mobile so can't check... Are those logging configs the defaults? If not
>>> .... revert to default...
>>> 
>>> BTW do your mons have stable quorum or are they flapping with this load?
>>> 
>>> .. dan
>>> 
>>> 
>>> 
>>> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff <
>>> janek.bevendorff@xxxxxxxxxxxxx> wrote:
>>> 
>>>> Thanks, Dan.
>>>> 
>>>> On the first MON, the command doesn’t even return, but I was able to get a
>>>> dump from the one I restarted most recently. The oldest ops look like this:
>>>> 
>>>>        {
>>>>            "description": "log(1000 entries from seq 17876238 at
>>>> 2021-02-25T15:13:20.306487+0100)",
>>>>            "initiated_at": "2021-02-25T20:40:34.698932+0100",
>>>>            "age": 183.762551121,
>>>>            "duration": 183.762599201,
>>>>            "type_data": {
>>>>                "events": [
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.698932+0100",
>>>>                        "event": "initiated"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.698636+0100",
>>>>                        "event": "throttled"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.698932+0100",
>>>>                        "event": "header_read"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.701407+0100",
>>>>                        "event": "all_read"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.701455+0100",
>>>>                        "event": "dispatched"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.701458+0100",
>>>>                        "event": "mon:_ms_dispatch"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.701459+0100",
>>>>                        "event": "mon:dispatch_op"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.701459+0100",
>>>>                        "event": "psvc:dispatch"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.701490+0100",
>>>>                        "event": "logm:wait_for_readable"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.701491+0100",
>>>>                        "event": "logm:wait_for_readable/paxos"
>>>>                    },
>>>>                                        {
>>>>                        "time": "2021-02-25T20:40:34.701496+0100",
>>>>                        "event": "paxos:wait_for_readable"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.989198+0100",
>>>>                        "event": "callback finished"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.989199+0100",
>>>>                        "event": "psvc:dispatch"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.989208+0100",
>>>>                        "event": "logm:preprocess_query"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.989208+0100",
>>>>                        "event": "logm:preprocess_log"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.989278+0100",
>>>>                        "event": "forward_request_leader"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:40:34.989344+0100",
>>>>                        "event": "forwarded"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:41:58.658022+0100",
>>>>                        "event": "resend forwarded message to leader"
>>>>                    },
>>>>                    {
>>>>                        "time": "2021-02-25T20:42:27.735449+0100",
>>>>                        "event": "resend forwarded message to leader"
>>>>                    }
>>>>                ],
>>>>                "info": {
>>>>                    "seq": 41550,
>>>>                    "src_is_mon": false,
>>>>                    "source": "osd.104 v2:XXX:6864/16579",
>>>>                    "forwarded_to_leader": true
>>>>                }
>>>> 
>>>> 
>>>> Any idea what that might be about? Almost looks like this:
>>>> https://tracker.ceph.com/issues/24180
>>>> I set debug_mon to 0, but I keep getting a lot of log spill in journals.
>>>> It’s about 1-2 messages per second, mostly RocksDB stuff, but nothing that
>>>> actually looks serious or even log-worthy. I noticed that before that
>>>> despite logging being set to warning level, the cluster log keeps being
>>>> written to the MON log. But it shouldn’t cause such massive stability
>>>> issues, should it? The date on the log op is also weird. 15:13+0100 was
>>>> hours ago.
>>>> 
>>>> Here’s my log config:
>>>> 
>>>> global                            advanced  clog_to_syslog_level
>>>>                      warning
>>>> global                            basic     err_to_syslog
>>>>                       true
>>>> global                            basic     log_to_file
>>>>                       false
>>>> global                            basic     log_to_stderr
>>>>                       false
>>>> global                            basic     log_to_syslog
>>>>                       true
>>>> global                            advanced  mon_cluster_log_file_level
>>>>                      error
>>>> global                            advanced  mon_cluster_log_to_file
>>>>                       false
>>>> global                            advanced  mon_cluster_log_to_stderr
>>>>                       false
>>>> global                            advanced  mon_cluster_log_to_syslog
>>>>                       false
>>>> global                            advanced
>>>> mon_cluster_log_to_syslog_level                      warning
>>>> 
>>>> 
>>>> 
>>>> Ceph version is 15.2.8.
>>>> 
>>>> Janek
>>>> 
>>>> 
>>>> On 25. Feb 2021, at 20:33, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>>>> 
>>>> ceph daemon mon.`hostname -s` ops
>>>> 
>>>> That should show you the accumulating ops.
>>>> 
>>>> .. dan
>>>> 
>>>> 
>>>> On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff <
>>>> janek.bevendorff@xxxxxxxxxxxxx> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> All of a sudden, we are experiencing very concerning MON behaviour. We
>>>>> have five MONs and all of them have thousands up to tens of thousands of
>>>>> slow ops, the oldest one blocking basically indefinitely (at least the
>>>>> timer keeps creeping up). Additionally, the MON stores keep inflating
>>>>> heavily. Under normal circumstances we have about 450-550MB there. Right
>>>>> now its 27GB and growing (rapidly).
>>>>> 
>>>>> I tried restarting all MONs, I disabled auto-scaling (just in case) and
>>>>> checked the system load and hardware. I also restarted the MGR and MDS
>>>>> daemons, but to no avail.
>>>>> 
>>>>> Is there any way I can debug this properly? I can’t seem to find how I
>>>>> can actually view what ops are causing this and what client (if any) may be
>>>>> responsible for it.
>>>>> 
>>>>> Thanks
>>>>> Janek
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
>> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx