Upgrade Documentation: Wait for recovery

Richard Bade <hitrich@xxxxxxxxx> · Tue, 18 Jun 2019 12:29:51 +1200

Hi Everyone,
Recently we moved a bunch of our servers from one rack to another. In
the late stages of this we hit a point when some requests were blocked
due to one pg being in "peered" state.

This was unexpected to us, but on discussion with Wido we understand
why this happened. However it's brought up another point in that we
believed we were following the instructions as per upgrade
documentation. We've done our upgrades this way in the past without
hitting this "peered" state. The documentation says this:
"Ensure each upgraded Ceph OSD Daemon has rejoined the cluster"

We read this that you can go through and restart all the osd's one by
one in the whole cluster without waiting for recovery to happen.
Whereas it seems more like it should be:
"Ensure each upgraded Ceph OSD Daemon has rejoined the cluster" and
"ensure recovery has completed before moving on to the next {failure
domain}" where failure domain is host, rack etc depending on what is
in your crush map.

Thoughts? Should the documentation be more clear on this to help
people such as myself making this mistake?

Rich
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com