Working ceph cluster reports large amount of pgs in state unknown/undersized and objects degraded

Tobias Langner <tlangner+ceph@xxxxxxxxxxxx> · Thu, 18 Apr 2024 20:08:27 +0200

We operate a tiny ceph cluster (v16.2.7) across three machines, each 
running two OSDs and one of each mds, mgr, and mon. The cluster serves 
one main erasure-coded (2+1) storage pool and a few other 
management-related pools. The cluster has been running smoothly for 
several months.
A few weeks ago we noticed a health warning reporting 
backfillfull/nearfull osds and pools. Here is the output of `ceph -s` at 
this point (extraced from logs):

--------------------------------------------------------------------------------
  cluster:
    health: HEALTH_WARN
            1 backfillfull osd(s)
            2 nearfull osd(s)
            Reduced data availability: 163 pgs inactive, 1 pg peering
            Low space hindering backfill (add storage if this doesn't 
resolve itself): 2 pgs backfill_toofull
            Degraded data redundancy: 1486709/10911157 objects degraded 
(13.626%), 68 pgs degraded, 68 pgs undersized
            162 pgs not scrubbed in time
            6 pool(s) backfillfull

  services:
    mon: 3 daemons, quorum mon.101,mon.102,mon.100 (age 5m)
    mgr: mgr-102(active, since 54m), standbys: mgr-101, mgr-100
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 6 osds: 6 up (since 4m), 6 in (since 2w); 7 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   6 pools, 338 pgs
    objects: 3.64M objects, 14 TiB
    usage:   13 TiB used, 1.7 TiB / 15 TiB avail
    pgs:     47.929% pgs unknown
             0.296% pgs not active
             1486709/10911157 objects degraded (13.626%)
             52771/10911157 objects misplaced (0.484%)
             162 unknown
             106 active+clean
             67  active+undersized+degraded
             1 active+undersized+degraded+remapped+backfill_toofull
             1   remapped+peering
             1   active+remapped+backfill_toofull
--------------------------------------------------------------------------------

I now see the large amount of pgs in state unknown and the fact that a 
significant fraction of objects is degraded despite all osds being up, 
but we didn't notice this back then.
Because the cluster continued to act fine from the perspective of the 
mounted filesystem, we didn't really notice the potential problem and 
did not intervene. From then one, things have mostly gone downwards. 
Now, `ceph -s` reports the following:

--------------------------------------------------------------------------------
  cluster:
    health: HEALTH_WARN
            noout flag(s) set
            Reduced data availability: 117 pgs inactive
            Degraded data redundancy: 2095625/12121767 objects degraded 
(17.288%), 114 pgs degraded, 114 pgs undersized
            117 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum mon.101,mon.102,mon.100 (age 15h)
    mgr: mgr-102(active, since 7d), standbys: mgr-100, mgr-101
    mds: 1/1 daemons up, 1 standby, 1 hot standby
    osd: 6 osds: 6 up (since 55m), 6 in (since 3w)
         flags noout

  data:
    volumes: 1/1 healthy
    pools:   6 pools, 338 pgs
    objects: 4.04M objects, 15 TiB
    usage:   12 TiB used, 2.8 TiB / 15 TiB avail
    pgs:     34.615% pgs unknown
             2095625/12121767 objects degraded (17.288%)
             117 unknown
             114 active+undersized+degraded
             107 active+clean
--------------------------------------------------------------------------------

Note in particular the still very large number of pgs in state unknown, 
which hasn't changed in days. Same goes for the degraded pgs. Also, the 
cluster should have around 37TiB storage available but now it only 
reports 15 TiB.
We did a bit of digging around but couldn't really get to the bottom of 
the unknown pgs and how we can recover from that. One other data point 
is that the command `ceph osd df tree` gets stuck on two of the three 
machines and one the one where it returns something, it looks like this:

--------------------------------------------------------------------------------
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP META    
AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
 -1         47.67506         -      0 B      0 B      0 B     0 B     0 
B      0 B      0     0    -          root default
-13         18.26408         -      0 B      0 B      0 B     0 B     0 
B      0 B      0     0    -              datacenter dc.100
 -5         18.26408         -      0 B      0 B      0 B     0 B     0 
B      0 B      0     0    -                  host osd-100
  3    hdd  10.91409   1.00000      0 B      0 B      0 B     0 B     0 
B      0 B      0     0   91      up osd.3
  5    hdd   7.34999   1.00000      0 B      0 B      0 B     0 B     0 
B      0 B      0     0   48      up osd.5
 -9         14.69998         -      0 B      0 B      0 B     0 B     0 
B      0 B      0     0    -              datacenter dc.101
 -7         14.69998         -      0 B      0 B      0 B     0 B     0 
B      0 B      0     0    -                  host osd-101
  0    hdd   7.34999   1.00000      0 B      0 B      0 B     0 B     0 
B      0 B      0     0   83      up osd.0
  1    hdd   7.34999   1.00000      0 B      0 B      0 B     0 B     0 
B      0 B      0     0   86      up osd.1
-11         14.71100         -   15 TiB   12 TiB   12 TiB  77 MiB 21 
GiB  2.6 TiB  82.00  1.00    -              datacenter dc.102
-17          7.35550         -  7.4 TiB  6.3 TiB  6.2 TiB  16 MiB 11 
GiB  1.1 TiB  85.16  1.04    -                  host osdroid-102-1
  4    hdd   7.35550   1.00000  7.4 TiB  6.3 TiB  6.2 TiB  16 MiB 11 
GiB  1.1 TiB  85.16  1.04  114      up osd.4
-15          7.35550         -  7.4 TiB  5.8 TiB  5.7 TiB  61 MiB 10 
GiB  1.6 TiB  78.83  0.96    -                  host osdroid-102-2
  2    hdd   7.35550   1.00000  7.4 TiB  5.8 TiB  5.7 TiB  61 MiB 10 
GiB  1.6 TiB  78.83  0.96  107      up osd.2
                         TOTAL   15 TiB   12 TiB   12 TiB  77 MiB 21 
GiB  2.6 TiB 82.00
MIN/MAX VAR: 0/1.04  STDDEV: 66.97
--------------------------------------------------------------------------------

The odd part here is that for some reason only osd.2 and osd.4 seem to 
contribute size to the cluster. Interestingly, accessing content from 
the storage pool works mostly without issues, which shouldn't work if 4 
out of 6 OSDs weren't properly up.

Even more odd is that while `ceph health detail` reports a lot of pgs in 
state unknown, undersized, and degraded, inspecting the respective pgs 
with `ceph pg <pdid> query` results in active+clean for *all* of them... 
I'm not sure which of the two pieces of information I am supposed to 
trust...

Any ideas what we can do to get our cluster back into a sane state? I'm 
happy to provide more logs or command output, please let me know.

Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx