> > > > > > > The adviced solution is to upgrade ceph only in HEALTH_OK state. And I > > also read somewhere that is bad to have your cluster for a long time in > > an HEALTH_ERR state. > > > > But why is this bad? > > Aside from the obvious (errors are bad things!), many people have > external monitoring systems that will alert them on the transitions > between OK/WARN/ERR. If the system is stuck in ERR for a long time, > they are unlikely to notice new errors or warnings. These systems can > accumulate faults without the operator noticing. All obvious, I would expect such answer on psychology mailing list ;) I am mostly testing with ceph, and trying to educate myself a bit. I am asking because I had this error in sep. 2017 then when changing the crush reweight it disappeared, on jan.2018 after scrubbing it appeared and now after adding the 4th node it disappeared again. > > Why is this bad during upgrading? > > It depends what's gone wrong. For example: > - If your cluster is degraded (fewer than desired number of replicas > of data) then taking more services offline (even briefly) to do an > upgrade will create greater risk to the data by reducing the number of > copies available. > - If your system is in an error state because something has gone bad > on disk, then recovering it with the same software that wrote the data > is a more tested code path than running some newer code against a > system left in a strange state by an older version. > > There will always be exceptions to this (e.g. where the upgrade is the > fix for whatever caused the error), but the general purpose advice is > to get a system nice and clean before starting the upgrade. > > John > > > Can I quantify how bad it is? (like with large log/journal file?) > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com