Re: 2 pools - 513 pgs 100.00% pgs unknown - working cluster

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Thu, 26 May 2022 16:43:56 +0200

Is this a mimic/nautilus cluster? I think I remember a similar issue
about 3-4 years ago with mimic (or even luminous?) at the time.
Afair, we were required not reboot all mgrs, mons and finally all osds
until things started to stabilise.

Best regards,

Nico

Eneko Lacunza <elacunza@xxxxxxxxx> writes:

> Thanks, yes I have stopped active mgr and let standby take over, twice
> at least, but no change.
>
> El 26/5/22 a las 16:08, Eugen Block escribió:
>> First thing I would try is a mgr failover.
>>
>> Zitat von Eneko Lacunza <elacunza@xxxxxxxxx>:
>>
>>> Hi all,
>>>
>>> I'm trying to diagnose a issue in a tiny cluster that is showing
>>> the following status:
>>>
>>>
>>> root@proxmox3:~# ceph -s
>>>   cluster:
>>>     id:     80d78bb2-6be6-4dff-b41d-60d52e650016
>>>     health: HEALTH_WARN
>>>             1/3 mons down, quorum 0,proxmox3
>>>             Reduced data availability: 513 pgs inactive
>>>
>>>   services:
>>>     mon: 3 daemons, quorum 0,proxmox3 (age 3h), out of quorum: 1
>>>     mgr: proxmox3(active, since 16m), standbys: proxmox2
>>>     osd: 12 osds: 8 up (since 3h), 8 in (since 3h)
>>>
>>>   task status:
>>>
>>>   data:
>>>     pools:   2 pools, 513 pgs
>>>     objects: 0 objects, 0 B
>>>     usage:   0 B used, 0 B / 0 B avail
>>>     pgs:     100.000% pgs unknown
>>>              513 unknown
>>>
>>> Cluster has 3 nodes, each with 4 OSDs. One of the nodes was offline
>>> for 3 weeks, and when bringing it back online VMs stalled on disk
>>> I/O.
>>>
>>> Node has been shut down again and we're trying to understand the
>>> status, an then will try ti diagnose issue with the troubled node.
>>>
>>> Currently VMs are working and can read RBD volumes, but there seems
>>> to be some kind of mgr issue (?) with stats.
>>>
>>> There is no firewall on the nodes nor between the 3 nodes (all on
>>> the same switch). Ping is working for both CEph public and private
>>> networks.
>>>
>>> MGR log show this continuosly:
>>> 2022-05-26T13:49:45.603+0200 7fb78ba3f700  0 auth: could not find
>>> secret_id=1892
>>> 2022-05-26T13:49:45.603+0200 7fb78ba3f700  0 cephx:
>>> verify_authorizer could not get service secret for service mgr
>>> secret_id=1892
>>> 2022-05-26T13:49:45.983+0200 7fb77a18d700  1 mgr.server send_report
>>> Not sending PG status to monitor yet, waiting for OSDs
>>> 2022-05-26T13:49:47.983+0200 7fb77a18d700  1 mgr.server send_report
>>> Not sending PG status to monitor yet, waiting for OSDs
>>> 2022-05-26T13:49:49.983+0200 7fb77a18d700  1 mgr.server send_report
>>> Not sending PG status to monitor yet, waiting for OSDs
>>> 2022-05-26T13:49:51.983+0200 7fb77a18d700  1 mgr.server send_report
>>> Giving up on OSDs that haven't reported yet, sending potentially
>>> incomplete PG state to m
>>> on
>>> 2022-05-26T13:49:51.983+0200 7fb77a18d700  0 log_channel(cluster)
>>> log [DBG] : pgmap v3: 513 pgs: 513 unknown; 0 B data, 0 B used, 0 B
>>> / 0 B avail
>>> 2022-05-26T13:49:53.983+0200 7fb77a18d700  0 log_channel(cluster)
>>> log [DBG] : pgmap v4: 513 pgs: 513 unknown; 0 B data, 0 B used, 0 B
>>> / 0 B avail
>>> 2022-05-26T13:49:55.983+0200 7fb77a18d700  0 log_channel(cluster)
>>> log [DBG] : pgmap v5: 513 pgs: 513 unknown; 0 B data, 0 B used, 0 B
>>> / 0 B avail
>>> 2022-05-26T13:49:57.987+0200 7fb77a18d700  0 log_channel(cluster)
>>> log [DBG] : pgmap v6: 513 pgs: 513 unknown; 0 B data, 0 B used, 0 B
>>> / 0 B avail
>>> 2022-05-26T13:49:58.403+0200 7fb78ba3f700  0 auth: could not find
>>> secret_id=1892
>>> 2022-05-26T13:49:58.403+0200 7fb78ba3f700  0 cephx:
>>> verify_authorizer could not get service secret for service mgr
>>> secret_id=1892
>>>
>>> So it seems that mgr is unable to contact OSDs for stats, then
>>> reports bad info to mon.
>>>
>>> I see the following OSD ports open:
>>> tcp        0      0 192.168.134.102:6800    0.0.0.0:* LISTEN
>>> 2268/ceph-osd
>>> tcp        0      0 192.168.133.102:6800    0.0.0.0:* LISTEN
>>> 2268/ceph-osd
>>> tcp        0      0 192.168.134.102:6801    0.0.0.0:* LISTEN
>>> 2268/ceph-osd
>>> tcp        0      0 192.168.133.102:6801    0.0.0.0:* LISTEN
>>> 2268/ceph-osd
>>> tcp        0      0 192.168.134.102:6802    0.0.0.0:* LISTEN
>>> 2268/ceph-osd
>>> tcp        0      0 192.168.133.102:6802    0.0.0.0:* LISTEN
>>> 2268/ceph-osd
>>> tcp        0      0 192.168.134.102:6803    0.0.0.0:* LISTEN
>>> 2268/ceph-osd
>>> tcp        0      0 192.168.133.102:6803    0.0.0.0:* LISTEN
>>> 2268/ceph-osd
>>> tcp        0      0 192.168.134.102:6804    0.0.0.0:* LISTEN
>>> 2271/ceph-osd
>>> tcp        0      0 192.168.133.102:6804    0.0.0.0:* LISTEN
>>> 2271/ceph-osd
>>> tcp        0      0 192.168.134.102:6805    0.0.0.0:* LISTEN
>>> 2271/ceph-osd
>>> tcp        0      0 192.168.133.102:6805    0.0.0.0:* LISTEN
>>> 2271/ceph-osd
>>> tcp        0      0 192.168.134.102:6806    0.0.0.0:* LISTEN
>>> 2271/ceph-osd
>>> tcp        0      0 192.168.133.102:6806    0.0.0.0:* LISTEN
>>> 2271/ceph-osd
>>> tcp        0      0 192.168.134.102:6807    0.0.0.0:* LISTEN
>>> 2271/ceph-osd
>>> tcp        0      0 192.168.133.102:6807    0.0.0.0:* LISTEN
>>> 2271/ceph-osd
>>> tcp        0      0 192.168.134.102:6808    0.0.0.0:* LISTEN
>>> 2267/ceph-osd
>>> tcp        0      0 192.168.133.102:6808    0.0.0.0:* LISTEN
>>> 2267/ceph-osd
>>> tcp        0      0 192.168.134.102:6809    0.0.0.0:* LISTEN
>>> 2267/ceph-osd
>>> tcp        0      0 192.168.133.102:6809    0.0.0.0:* LISTEN
>>> 2267/ceph-osd
>>> tcp        0      0 192.168.134.102:6810    0.0.0.0:* LISTEN
>>> 2267/ceph-osd
>>> tcp        0      0 192.168.133.102:6810    0.0.0.0:* LISTEN
>>> 2267/ceph-osd
>>> tcp        0      0 192.168.134.102:6811    0.0.0.0:* LISTEN
>>> 2267/ceph-osd
>>> tcp        0      0 192.168.133.102:6811    0.0.0.0:* LISTEN
>>> 2267/ceph-osd
>>> tcp        0      0 192.168.134.102:6812    0.0.0.0:* LISTEN
>>> 2274/ceph-osd
>>> tcp        0      0 192.168.133.102:6812    0.0.0.0:* LISTEN
>>> 2274/ceph-osd
>>> tcp        0      0 192.168.134.102:6813    0.0.0.0:* LISTEN
>>> 2274/ceph-osd
>>> tcp        0      0 192.168.133.102:6813    0.0.0.0:* LISTEN
>>> 2274/ceph-osd
>>> tcp        0      0 192.168.134.102:6814    0.0.0.0:* LISTEN
>>> 2274/ceph-osd
>>> tcp        0      0 192.168.133.102:6814    0.0.0.0:* LISTEN
>>> 2274/ceph-osd
>>> tcp        0      0 192.168.134.102:6815    0.0.0.0:* LISTEN
>>> 2274/ceph-osd
>>> tcp        0      0 192.168.133.102:6815    0.0.0.0:* LISTEN
>>> 2274/ceph-osd
>>>
>>> Any idea what can I check/what's going on?
>>>
>>> Thanks
>>>
>>> Eneko Lacunza
>>> Zuzendari teknikoa | Director técnico
>>> Binovo IT Human Project
>>>
>>> Tel. +34 943 569 206 |https://www.binovo.es
>>> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>>>
>>> https://www.youtube.com/user/CANALBINOVO
>>> https://www.linkedin.com/company/37269706/
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx