Hi Cephers,
At University of Zurich we are using Ceph as a storage back-end for our
OpenStack installation. Since we recently reached 70% of occupancy
(mostly caused by the cinder pool served by 16384PGs) we are in the
phase of extending the cluster with additional storage nodes of the same
type (except for a slight more powerful CPU).
We decided to opt for a gradual OSD deployment: we created a temporary "root"
bucket called "fresh-install" containing the newly installed nodes and then we
moved OSDs from this bucket to the current production root via:
ceph osd crush set osd.{id} {weight} host={hostname} root={production_root}
Everything seemed nicely planned but when we started adding a few new
OSDs to the cluster, and thus triggering a rebalancing, one of the OSDs,
already at 84% disk use, passed the 85% threshold. This in turn
triggered the "near full osd(s)" warning and more than 20PGs previously
in "wait_backfill" state were marked as: "wait_backfill+backfill_toofull".
Since the OSD kept growing until, reached 90% disk use, we decided to reduce
its relative weight from 1 to 0.95.
The last action recalculated the crushmap and remapped a few PGs but did
not appear to move any data off the almost full OSD. Only when, by steps
of 0.05, we reached 0.50 of relative weight data was moved and some
"backfill_toofull" requests were released. However, he had do go down
almost to 0.10% of relative weight in order to trigger some additional
data movement and have the backfilling process finally finished.
We are now adding new OSDs but the problem is constantly triggered since
we have multiple OSDs > 83% that starts growing during the rebalance.
My questions are:
- Is there something wrong in our process of adding new OSDs (some additional
details below)?
- We also noticed that the problem has the tendency to cluster around the newly
added OSDs, so could those two things be correlated?
- Why reweighting does not trigger instant data moving? What's the logic
behind remapped PGs? Is there some sort of flat queue of tasks or does
it have some priorities defined?
- Did somebody experience this situation and eventually how was it solved/bypassed?
Cluster details are as follows:
- version: 0.94.9
- 5 monitors,
- 40 storage hosts with an overall of 24 X 4TB disks: 1 OSD/disk (960 OSDs in total),
- osd pool default size = 3,
- journaling is on SSDs.
We have "hosts" failure domain. Relevant crushmap details:
# rules
rule sas {
ruleset 1
type replicated
min_size 1
max_size 10
step take sas
step chooseleaf firstn 0 type host
step emit
}
root sas {
id -41 # do not change unnecessarily
# weight 3283.279
alg straw
hash 0 # rjenkins1
item osd-l2-16 weight 87.360
item osd-l4-06 weight 87.360
...
item osd-k7-41 weight 14.560
item osd-l4-36 weight 14.560
item osd-k5-36 weight 14.560
}
host osd-k7-21 {
id -46 # do not change unnecessarily
# weight 87.360
alg straw
hash 0 # rjenkins1
item osd.281 weight 3.640
item osd.282 weight 3.640
item osd.285 weight 3.640
...
}
host osd-k7-41 {
id -50 # do not change unnecessarily
# weight 14.560
alg straw
hash 0 # rjenkins1
item osd.900 weight 3.640
item osd.901 weight 3.640
item osd.902 weight 3.640
item osd.903 weight 3.640
}
As mentioned before we created a temporary bucket called "fresh-install"
containing the newly installed nodes (i.e.):
root fresh-install {
id -34 # do not change unnecessarily
# weight 218.400
alg straw
hash 0 # rjenkins1
item osd-k5-36-fresh weight 72.800
item osd-k7-41-fresh weight 72.800
item osd-l4-36-fresh weight 72.800
}
Then, by steps of 6 OSDs (2 OSDs from each new host), we move OSDs from
the "fresh-install" to the "sas" bucket.
Thank you in advance for all the suggestions.
Cheers,
Tyanko
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com