Thanks interesting to read. So in luminous it is not really a problem. I was expecting to get into trouble with the monitors/mds. Because my failover takes quite long, and thought it was related to the damaged pg Luminous: "When the past intervals tracking structure was rebuilt around exactly the information required, it became extremely compact and relatively insensitive to extended periods of cluster unhealthiness" > > > > > > The adviced solution is to upgrade ceph only in HEALTH_OK state. And I > > also read somewhere that is bad to have your cluster for a long time in > > an HEALTH_ERR state. > > > > But why is this bad? See https://ceph.com/community/new-luminous-pg-overdose-protection under "Problems with past intervals" "if the cluster becomes unhealthy, and especially if it remains unhealthy for an extended period of time, a combination of effects can cause problems." "If a cluster is unhealthy for an extended period of time (e.g., days or even weeks), the past interval set can become large enough to require a significant amount of memory." _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com