We operate a tiny ceph cluster (v16.2.7) across three machines, each
running two OSDs and one of each mds, mgr, and mon. The cluster serves
one main erasure-coded (2+1) storage pool and a few other
management-related pools. The cluster has been running smoothly for
several months.
A few weeks ago we noticed a health warning reporting
backfillfull/nearfull osds and pools. Here is the output of `ceph -s` at
this point (extraced from logs):
--------------------------------------------------------------------------------
cluster:
health: HEALTH_WARN
1 backfillfull osd(s)
2 nearfull osd(s)
Reduced data availability: 163 pgs inactive, 1 pg peering
Low space hindering backfill (add storage if this doesn't
resolve itself): 2 pgs backfill_toofull
Degraded data redundancy: 1486709/10911157 objects degraded
(13.626%), 68 pgs degraded, 68 pgs undersized
162 pgs not scrubbed in time
6 pool(s) backfillfull
services:
mon: 3 daemons, quorum mon.101,mon.102,mon.100 (age 5m)
mgr: mgr-102(active, since 54m), standbys: mgr-101, mgr-100
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 6 osds: 6 up (since 4m), 6 in (since 2w); 7 remapped pgs
data:
volumes: 1/1 healthy
pools: 6 pools, 338 pgs
objects: 3.64M objects, 14 TiB
usage: 13 TiB used, 1.7 TiB / 15 TiB avail
pgs: 47.929% pgs unknown
0.296% pgs not active
1486709/10911157 objects degraded (13.626%)
52771/10911157 objects misplaced (0.484%)
162 unknown
106 active+clean
67 active+undersized+degraded
1 active+undersized+degraded+remapped+backfill_toofull
1 remapped+peering
1 active+remapped+backfill_toofull
--------------------------------------------------------------------------------
I now see the large amount of pgs in state unknown and the fact that a
significant fraction of objects is degraded despite all osds being up,
but we didn't notice this back then.
Because the cluster continued to act fine from the perspective of the
mounted filesystem, we didn't really notice the potential problem and
did not intervene. From then one, things have mostly gone downwards.
Now, `ceph -s` reports the following:
--------------------------------------------------------------------------------
cluster:
health: HEALTH_WARN
noout flag(s) set
Reduced data availability: 117 pgs inactive
Degraded data redundancy: 2095625/12121767 objects degraded
(17.288%), 114 pgs degraded, 114 pgs undersized
117 pgs not scrubbed in time
services:
mon: 3 daemons, quorum mon.101,mon.102,mon.100 (age 15h)
mgr: mgr-102(active, since 7d), standbys: mgr-100, mgr-101
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 6 osds: 6 up (since 55m), 6 in (since 3w)
flags noout
data:
volumes: 1/1 healthy
pools: 6 pools, 338 pgs
objects: 4.04M objects, 15 TiB
usage: 12 TiB used, 2.8 TiB / 15 TiB avail
pgs: 34.615% pgs unknown
2095625/12121767 objects degraded (17.288%)
117 unknown
114 active+undersized+degraded
107 active+clean
--------------------------------------------------------------------------------
Note in particular the still very large number of pgs in state unknown,
which hasn't changed in days. Same goes for the degraded pgs. Also, the
cluster should have around 37TiB storage available but now it only
reports 15 TiB.
We did a bit of digging around but couldn't really get to the bottom of
the unknown pgs and how we can recover from that. One other data point
is that the command `ceph osd df tree` gets stuck on two of the three
machines and one the one where it returns something, it looks like this:
--------------------------------------------------------------------------------
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS TYPE NAME
-1 47.67506 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - root default
-13 18.26408 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - datacenter dc.100
-5 18.26408 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - host osd-100
3 hdd 10.91409 1.00000 0 B 0 B 0 B 0 B 0
B 0 B 0 0 91 up osd.3
5 hdd 7.34999 1.00000 0 B 0 B 0 B 0 B 0
B 0 B 0 0 48 up osd.5
-9 14.69998 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - datacenter dc.101
-7 14.69998 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - host osd-101
0 hdd 7.34999 1.00000 0 B 0 B 0 B 0 B 0
B 0 B 0 0 83 up osd.0
1 hdd 7.34999 1.00000 0 B 0 B 0 B 0 B 0
B 0 B 0 0 86 up osd.1
-11 14.71100 - 15 TiB 12 TiB 12 TiB 77 MiB 21
GiB 2.6 TiB 82.00 1.00 - datacenter dc.102
-17 7.35550 - 7.4 TiB 6.3 TiB 6.2 TiB 16 MiB 11
GiB 1.1 TiB 85.16 1.04 - host osdroid-102-1
4 hdd 7.35550 1.00000 7.4 TiB 6.3 TiB 6.2 TiB 16 MiB 11
GiB 1.1 TiB 85.16 1.04 114 up osd.4
-15 7.35550 - 7.4 TiB 5.8 TiB 5.7 TiB 61 MiB 10
GiB 1.6 TiB 78.83 0.96 - host osdroid-102-2
2 hdd 7.35550 1.00000 7.4 TiB 5.8 TiB 5.7 TiB 61 MiB 10
GiB 1.6 TiB 78.83 0.96 107 up osd.2
TOTAL 15 TiB 12 TiB 12 TiB 77 MiB 21
GiB 2.6 TiB 82.00
MIN/MAX VAR: 0/1.04 STDDEV: 66.97
--------------------------------------------------------------------------------
The odd part here is that for some reason only osd.2 and osd.4 seem to
contribute size to the cluster. Interestingly, accessing content from
the storage pool works mostly without issues, which shouldn't work if 4
out of 6 OSDs weren't properly up.
Even more odd is that while `ceph health detail` reports a lot of pgs in
state unknown, undersized, and degraded, inspecting the respective pgs
with `ceph pg <pdid> query` results in active+clean for *all* of them...
I'm not sure which of the two pieces of information I am supposed to
trust...
Any ideas what we can do to get our cluster back into a sane state? I'm
happy to provide more logs or command output, please let me know.
Thanks!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx