Thanks Mehmet; I took a closer look at what I sent you and the problem
appears to be in the CRUSH map. At some point since anything was last
rebooted, I created rack buckets and moved the OSD nodes in under them:
# ceph osd crush add-bucket rack-0 rack
# ceph osd crush add-bucket rack-1 rack
# ceph osd crush move bcgonen-r0h0 rack=rack-0
# ceph osd crush move bcgonen-r0h1 rack=rack-0
# ceph osd crush move bcgonen-r1h0 rack=rack-1
All seemed fine at the time; it was not until bcgonen-r1h0 was rebooted
that stuff got weird. But as per "ceph osd tree" output, those rack
buckets were sitting next to the default root as opposed to under it.
Now that's fixed, and the cluster is backfilling remapped PGs.
// J
On 2023-03-31 16:01, Johan Hattne wrote:
Here goes:
# ceph -s
cluster:
id: e1327a10-8b8c-11ed-88b9-3cecef0e3946
health: HEALTH_OK
services:
mon: 5 daemons, quorum
bcgonen-a,bcgonen-b,bcgonen-c,bcgonen-r0h0,bcgonen-r0h1 (age 16h)
mgr: bcgonen-b.furndm(active, since 8d), standbys: bcgonen-a.qmmqxj
mds: 1/1 daemons up, 2 standby
osd: 36 osds: 36 up (since 16h), 36 in (since 3d); 1041 remapped pgs
data:
volumes: 1/1 healthy
pools: 3 pools, 1041 pgs
objects: 5.42M objects, 6.5 TiB
usage: 19 TiB used, 428 TiB / 447 TiB avail
pgs: 27087125/16252275 objects misplaced (166.667%)
1039 active+clean+remapped
2 active+clean+remapped+scrubbing+deep
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-14 149.02008 rack rack-1
-7 149.02008 host bcgonen-r1h0
20 hdd 14.55269 osd.20 up 1.00000 1.00000
21 hdd 14.55269 osd.21 up 1.00000 1.00000
22 hdd 14.55269 osd.22 up 1.00000 1.00000
23 hdd 14.55269 osd.23 up 1.00000 1.00000
24 hdd 14.55269 osd.24 up 1.00000 1.00000
25 hdd 14.55269 osd.25 up 1.00000 1.00000
26 hdd 14.55269 osd.26 up 1.00000 1.00000
27 hdd 14.55269 osd.27 up 1.00000 1.00000
28 hdd 14.55269 osd.28 up 1.00000 1.00000
29 hdd 14.55269 osd.29 up 1.00000 1.00000
34 ssd 1.74660 osd.34 up 1.00000 1.00000
35 ssd 1.74660 osd.35 up 1.00000 1.00000
-13 298.04016 rack rack-0
-3 149.02008 host bcgonen-r0h0
0 hdd 14.55269 osd.0 up 1.00000 1.00000
1 hdd 14.55269 osd.1 up 1.00000 1.00000
2 hdd 14.55269 osd.2 up 1.00000 1.00000
3 hdd 14.55269 osd.3 up 1.00000 1.00000
4 hdd 14.55269 osd.4 up 1.00000 1.00000
5 hdd 14.55269 osd.5 up 1.00000 1.00000
6 hdd 14.55269 osd.6 up 1.00000 1.00000
7 hdd 14.55269 osd.7 up 1.00000 1.00000
8 hdd 14.55269 osd.8 up 1.00000 1.00000
9 hdd 14.55269 osd.9 up 1.00000 1.00000
30 ssd 1.74660 osd.30 up 1.00000 1.00000
31 ssd 1.74660 osd.31 up 1.00000 1.00000
-5 149.02008 host bcgonen-r0h1
10 hdd 14.55269 osd.10 up 1.00000 1.00000
11 hdd 14.55269 osd.11 up 1.00000 1.00000
12 hdd 14.55269 osd.12 up 1.00000 1.00000
13 hdd 14.55269 osd.13 up 1.00000 1.00000
14 hdd 14.55269 osd.14 up 1.00000 1.00000
15 hdd 14.55269 osd.15 up 1.00000 1.00000
16 hdd 14.55269 osd.16 up 1.00000 1.00000
17 hdd 14.55269 osd.17 up 1.00000 1.00000
18 hdd 14.55269 osd.18 up 1.00000 1.00000
19 hdd 14.55269 osd.19 up 1.00000 1.00000
32 ssd 1.74660 osd.32 up 1.00000 1.00000
33 ssd 1.74660 osd.33 up 1.00000 1.00000
-1 0 root default
# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 31 flags
hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 2
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change
9833 lfor 0/0/584 flags hashpspool stripe_width 0 pg_autoscale_bias 4
pg_num_min 16 recovery_priority 5 application cephfs
pool 3 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode on
last_change 7630 lfor 0/1831/6544 flags hashpspool,bulk stripe_width 0
application cephfs
crush_rules 1 and 2 are just used to assign the data and meta pool to
HDD and SSD, respectively (failure domain: host).
// J
On 2023-03-31 15:37, ceph@xxxxxxxxxx wrote:
Need to know some more about your cluster...
Ceph -s
Ceph osd df tree
Replica or ec?
...
Perhaps this can give us some insight
Mehmet
Am 31. März 2023 18:08:38 MESZ schrieb Johan Hattne <johan@xxxxxxxxx>:
Dear all;
Up until a few hours ago, I had a seemingly normally-behaving
cluster (Quincy, 17.2.5) with 36 OSDs, evenly distributed across 3 of
its 6 nodes. The cluster is only used for CephFS and the only
non-standard configuration I can think of is that I had 2 active MDSs,
but only 1 standby. I had also doubled mds_cache_memory limit to 8 GB
(all OSD hosts have 256 G of RAM) at some point in the past.
Then I rebooted one of the OSD nodes. The rebooted node held one
of the active MDSs. Now the node is back up: ceph -s says the cluster
is healthy, but all PGs are in a active+clean+remapped state and
166.67% of the objects are misplaced (dashboard: -66.66% healthy).
The data pool is a threefold replica with 5.4M object, the number
of misplaced objects is reported as 27087410/16252446. The
denominator in the ratio makes sense to me (16.2M / 3 = 5.4M), but the
numerator does not. I also note that the ratio is *exactly* 5 / 3.
The filesystem is still mounted and appears to be usable, but df
reports it as 100% full; I suspect it would say 167% but that is
capped somewhere.
Any ideas about what is going on? Any suggestions for recovery?
// Best wishes; Johan
------------------------------------------------------------------------
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx