Yes, this did turn out to be our main issue. We also had a
smaller issue, but this was the one that caused parts of our pools
to go offline for a short time. Or, 'cause' was us adding some new
NVMe drives that were much larger than the ones we already had so
too many PGs got mapped to them but we didn't realize at first
that it was the problem. Taking those OSDs down again allowed us
to quickly recover though. It was a little hard to figure out, mostly because we had two
separate problems at the same time. Some kind of separate warning
message would have been nice (couldn't find anything in the logs),
and perhaps allow the PGs to activate anyway and put the cluster
in health_warn? My colleague built a lab copy of our environment virtualized and
we used that to recreate and then fix our issues. We are also working on installing more OSDs, as was our original
plan, so PGs per OSD will decrease over time. At the time we
thought to aim for 300 PGs per OSD, which I realize now was
probably not a great idea, something like 150 would have been
better. /Peter Den 2018-01-31 kl. 13:42, skrev Thomas
Bennett:
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com