Re: MGR failures and pg autoscaler

"Lo Re Giuseppe" <giuseppe.lore@xxxxxxx> · Tue, 25 Oct 2022 09:01:21 +0000

I have found the logs showing the progress module failure:

debug 2022-10-25T05:06:08.877+0000 7f40868e7700  0 [rbd_support INFO root] execute_trash_remove: task={"sequence": 150, "id": "fcc864a0-9bde-4512-9f84-be10976613db", "message": "Removing i
mage fulen-hdd/f3f237d2f7e304 from trash", "refs": {"action": "trash remove", "pool_name": "fulen-hdd", "pool_namespace": "", "image_id": "f3f237d2f7e304"}, "in_progress": true, "progress"
: 0.0}
debug 2022-10-25T05:06:08.884+0000 7f4106e90700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'progress' while running on mgr.naret-monitor03.escwyg: ('42efb95d-ceaa-4a91-a9b2-b91f65f1834d',)
debug 2022-10-25T05:06:08.884+0000 7f4106e90700 -1 progress.serve:
debug 2022-10-25T05:06:08.897+0000 7f4139e96700  0 log_channel(audit) log [DBG] : from='client.22182342 -' entity='client.combin' cmd=[{"format":"json","group_name":"combin","prefix":"fs subvolume info","sub_name":"combin-4b53e28d-2f59-11ed-8aa5-9aa9e2c5aae2","vol_name":"cephfs"}]: dispatch
debug 2022-10-25T05:06:08.884+0000 7f4106e90700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/progress/module.py", line 716, in serve
    self._process_pg_summary()
  File "/usr/share/ceph/mgr/progress/module.py", line 629, in _process_pg_summary
    ev = self._events[ev_id]
KeyError: '42efb95d-ceaa-4a91-a9b2-b91f65f1834d'

On 25.10.22, 09:58, "Lo Re  Giuseppe" <giuseppe.lore@xxxxxxx> wrote:

    Hi,
    Since some weeks we started to us pg autoscale on our pools.
    We run with version 16.2.7.
    Maybe a coincidence, maybe not,  from some weeks we started to experience mgr progress module failures:

    “””
    [root@naret-monitor01 ~]# ceph -s
      cluster:
        id:     63334166-d991-11eb-99de-40a6b72108d0
        health: HEALTH_ERR
                Module 'progress' has failed: ('346ee7e0-35f0-4fdf-960e-a36e7e2441e4',)
                1 pool(s) full  services:
        mon: 3 daemons, quorum naret-monitor01,naret-monitor02,naret-monitor03 (age 5d)
        mgr: naret-monitor02.ciqvgv(active, since 6d), standbys: naret-monitor03.escwyg, naret-monitor01.suwugf
        mds: 1/1 daemons up, 2 standby
        osd: 760 osds: 760 up (since 4d), 760 in (since 4d); 10 remapped pgs
        rgw: 3 daemons active (3 hosts, 1 zones)  data:
        volumes: 1/1 healthy
        pools:   32 pools, 6250 pgs
        objects: 977.79M objects, 3.6 PiB
        usage:   5.7 PiB used, 5.1 PiB / 11 PiB avail
        pgs:     4602612/5990777501 objects misplaced (0.077%)
                 6214 active+clean
                 25   active+clean+scrubbing+deep
                 10   active+remapped+backfilling
                 1    active+clean+scrubbing  io:
        client:   243 MiB/s rd, 292 MiB/s wr, 1.68k op/s rd, 842 op/s wr
        recovery: 430 MiB/s, 109 objects/s  progress:
        Global Recovery Event (14h)
          [===========================.] (remaining: 70s)
    “””

    In the mgr logs I see:
    “””

    debug 2022-10-20T23:09:03.859+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 2 has overlapping roots: {-60, -1}

    debug 2022-10-20T23:09:03.863+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 3 has overlapping roots: {-60, -1, -2}

    debug 2022-10-20T23:09:03.866+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 5 has overlapping roots: {-60, -1, -2}

    debug 2022-10-20T23:09:03.870+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 6 has overlapping roots: {-60, -1, -2}

    debug 2022-10-20T23:09:03.873+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 10 has overlapping roots: {-105, -60,

    -1, -2}

    debug 2022-10-20T23:09:03.877+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 11 has overlapping roots: {-105, -60,

    -1, -2}

    debug 2022-10-20T23:09:03.880+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 12 has overlapping roots: {-105, -60,

    -1, -2}

    debug 2022-10-20T23:09:03.884+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 13 has overlapping roots: {-105, -60,

    -1, -2}

    debug 2022-10-20T23:09:03.887+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 14 has overlapping roots: {-105, -60,

    -1, -2}

    debug 2022-10-20T23:09:03.891+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 15 has overlapping roots: {-105, -60,

    -1, -2}

    debug 2022-10-20T23:09:03.894+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 26 has overlapping roots: {-105, -60,

    -1, -2}

    debug 2022-10-20T23:09:03.898+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 28 has overlapping roots: {-105, -60, -1, -2}

    debug 2022-10-20T23:09:03.901+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 29 has overlapping roots: {-105, -60, -1, -2}

    debug 2022-10-20T23:09:03.905+0000 7fba5f300700  0 [pg_autoscaler ERROR root] pool 30 has overlapping roots: {-105, -60, -1, -2}

    ...
    “””
    Do you have any explanation/fix for this errors?
    Regards,

    Giuseppe

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx