Hi all, I have a large distributed ceph cluster that recently broke with all PGs housed at a single site getting marked as 'unknown' after a run of the Ceph Ansible playbook (which was being used to expand the cluster at a third site). Is there a way to recover the location of PGs in this state, or a way to fall back to a previous config where things were working? Or a way to scan the OSDs to determine which PGs are housed there? All the OSDs are still in place and reporting as healthy, it's just the PG locations that are missing. For info: the ceph cluster is used to provide a single shared CephFS mount for a distributed batch cluster, and it includes workers and pools of OSDs from three different OpenStack clouds. Ceph version: 13.2.8 Here is the system health: [root@euclid-edi-ctrl-0 ~]# ceph -s cluster: id: 0fe7e967-ecd6-46d4-9f6b-224539073d3b health: HEALTH_WARN insufficient standby MDS daemons available 1 MDSs report slow metadata IOs Reduced data availability: 1024 pgs inactive 6 slow ops, oldest one blocked for 244669 sec, mon.euclid-edi-ctrl-0 has slow ops too few PGs per OSD (26 < min 30) services: mon: 4 daemons, quorum euclid-edi-ctrl-0,euclid-cam-proxy-0,euclid-imp-proxy-0,euclid-ral-proxy-0 mgr: euclid-edi-ctrl-0(active), standbys: euclid-imp-proxy-0, euclid-cam-proxy-0, euclid-ral-proxy-0 mds: cephfs-2/2/2 up {0=euclid-ral-proxy-0=up:active,1=euclid-cam-proxy-0=up:active} osd: 269 osds: 269 up, 269 in data: pools: 5 pools, 5120 pgs objects: 30.54 M objects, 771 GiB usage: 3.8 TiB used, 41 TiB / 45 TiB avail pgs: 20.000% pgs unknown 4095 active+clean 1024 unknown 1 active+clean+scrubbing OSD Pools: [root@euclid-edi-ctrl-0 ~]# ceph osd lspools 1 cephfs_data 2 cephfs_metadata 3 euclid_cam 4 euclid_ral 5 euclid_imp [root@euclid-edi-ctrl-0 ~]# ceph pg dump_pools_json dumped pools POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG 5 0 0 0 0 0 0 0 0 0 0 1 16975540 0 0 0 0 79165311663 0 0 6243475 6243475 2 5171099 0 0 0 0 551991405 126879876 270829 3122183 3122183 3 8393436 0 0 0 0 748466429315 0 0 1556647 1556647 4 0 0 0 0 0 0 0 0 0 0 [root@euclid-edi-ctrl-0 ~]# ceph health detail ... PG_AVAILABILITY Reduced data availability: 1024 pgs inactive pg 4.3c8 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3ca is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3cb is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d0 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d1 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d2 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d3 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d4 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d5 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d6 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d7 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d8 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3d9 is stuck inactive for 246794.767182, current state unknown, last acting [] pg 4.3da is stuck inactive for 246794.767182, current state unknown, last acting [] ... [root@euclid-edi-ctrl-0 ~]# ceph pg map 4.3c8 osdmap e284992 pg 4.3c8 (4.3c8) -> up [] acting [] Cheers, Mark _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx