On Wed, Sep 5, 2018 at 8:38 AM Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx> wrote: > > > The adviced solution is to upgrade ceph only in HEALTH_OK state. And I > also read somewhere that is bad to have your cluster for a long time in > an HEALTH_ERR state. > > But why is this bad? Aside from the obvious (errors are bad things!), many people have external monitoring systems that will alert them on the transitions between OK/WARN/ERR. If the system is stuck in ERR for a long time, they are unlikely to notice new errors or warnings. These systems can accumulate faults without the operator noticing. > Why is this bad during upgrading? It depends what's gone wrong. For example: - If your cluster is degraded (fewer than desired number of replicas of data) then taking more services offline (even briefly) to do an upgrade will create greater risk to the data by reducing the number of copies available. - If your system is in an error state because something has gone bad on disk, then recovering it with the same software that wrote the data is a more tested code path than running some newer code against a system left in a strange state by an older version. There will always be exceptions to this (e.g. where the upgrade is the fix for whatever caused the error), but the general purpose advice is to get a system nice and clean before starting the upgrade. John > Can I quantify how bad it is? (like with large log/journal file?) > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com