How can I recover PGs in state 'unknown', where OSD location seems to be lost?

"Mark S. Holliman" <msh@xxxxxxxxx> · Mon, 23 Mar 2020 11:50:22 +0000

Hi all,

I have a large distributed ceph cluster that recently broke with all PGs housed at a single site getting marked as 'unknown' after a run of the Ceph Ansible playbook (which was being used to expand the cluster at a third site).  Is there a way to recover the location of PGs in this state, or a way to fall back to a previous config where things were working?  Or a way to scan the OSDs to determine which PGs are housed there?  All the OSDs are still in place and reporting as healthy, it's just the PG locations that are missing.  For info: the ceph cluster is used to provide a single shared CephFS mount for a distributed batch cluster, and it includes workers and pools of OSDs from three different OpenStack clouds.

Ceph version: 13.2.8

Here is the system health:

[root@euclid-edi-ctrl-0 ~]# ceph -s
  cluster:
    id:     0fe7e967-ecd6-46d4-9f6b-224539073d3b
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            1 MDSs report slow metadata IOs
            Reduced data availability: 1024 pgs inactive
            6 slow ops, oldest one blocked for 244669 sec, mon.euclid-edi-ctrl-0 has slow ops
            too few PGs per OSD (26 < min 30)

  services:
    mon: 4 daemons, quorum euclid-edi-ctrl-0,euclid-cam-proxy-0,euclid-imp-proxy-0,euclid-ral-proxy-0
    mgr: euclid-edi-ctrl-0(active), standbys: euclid-imp-proxy-0, euclid-cam-proxy-0, euclid-ral-proxy-0
    mds: cephfs-2/2/2 up  {0=euclid-ral-proxy-0=up:active,1=euclid-cam-proxy-0=up:active}
    osd: 269 osds: 269 up, 269 in

  data:
    pools:   5 pools, 5120 pgs
    objects: 30.54 M objects, 771 GiB
    usage:   3.8 TiB used, 41 TiB / 45 TiB avail
    pgs:     20.000% pgs unknown
             4095 active+clean
             1024 unknown
             1    active+clean+scrubbing

OSD Pools:
[root@euclid-edi-ctrl-0 ~]# ceph osd lspools
1 cephfs_data
2 cephfs_metadata
3 euclid_cam
4 euclid_ral
5 euclid_imp
[root@euclid-edi-ctrl-0 ~]# ceph pg dump_pools_json
dumped pools
POOLID OBJECTS  MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES        OMAP_BYTES* OMAP_KEYS* LOG     DISK_LOG
5             0                  0        0         0       0            0           0          0       0        0
1      16975540                  0        0         0       0  79165311663           0          0 6243475  6243475
2       5171099                  0        0         0       0    551991405   126879876     270829 3122183  3122183
3       8393436                  0        0         0       0 748466429315           0          0 1556647  1556647
4             0                  0        0         0       0            0           0          0       0        0

[root@euclid-edi-ctrl-0 ~]# ceph health detail
...
PG_AVAILABILITY Reduced data availability: 1024 pgs inactive
    pg 4.3c8 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3ca is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3cb is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d0 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d1 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d2 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d3 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d4 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d5 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d6 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d7 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d8 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3d9 is stuck inactive for 246794.767182, current state unknown, last acting []
    pg 4.3da is stuck inactive for 246794.767182, current state unknown, last acting []
...
[root@euclid-edi-ctrl-0 ~]# ceph pg map 4.3c8
osdmap e284992 pg 4.3c8 (4.3c8) -> up [] acting []

Cheers,
  Mark

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx