Hi,
I got into a weird and unexpected situation today. I added 6 hosts to
an existing Pacific cluster (16.2.13, 20 existing OSD hosts across 2
DCs). The hosts were added to the root=default subtree, their
designated location is one of two datacenters underneath the default
root. Nothing unusual, I believe many people use different subtrees to
organize their cluster, as do we in our own (and haven't seen above
issue yet).
The main application is RGW, the main pool is erasure-coded (k=7,
m=11). The crush rule looks like this:
rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 2 type datacenter
step chooseleaf indep 9 type host
step emit
}
After almost all peering had finished the status showed 6 inactive +
peering PGs for a while. I had to fail the mgr because it didn't
report correct stats anymore, then it showed 16 unknown PGs. Their
application noticed the (unexpected) disruption, after putting the
hosts into their designated crush bucket (datacenter) the situation
resolved. But I can't make any sense of it, I tried to reproduce it in
my lab environment (Quincy), but to no avail. In my tests it behaves
as expected, after new OSDs become active there are remapped PGs, but
nothing happens until I add them to their designated location.
I know I could have prevented that with either
osd_crush_initial_weight = 0, then move the crush buckets, then
reweight, or by adding the crush buckets first, but usually I don't
need to bother about these things.
Does anyone have an explanation? I'd appreciate any comments.
Thanks!
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx