> 2) RadosRely.py uses the following rebuild_time formula: > > seconds = float(self.disk.size * self.full) / (speed * self.pgs) > with self.pgs(declustered factor)= number of PGs in OSD and > speed=expected recovery rate (bytes/second) > > I think that the assumption is not realistic in large deployments with > multiple failing nodes. Typically the replication bandwidth will reach > a limit when too many nodes are recovering in parallel. I assume, > however, that ignore that problem is enough to start. Do you agree? I agree, in fact I have a card on my kanban board that's been sitting in "next" for a while. Ceph is actually configurable in this respect such that you can prioritize client operations over recovery operations as well as control the concurrency of placement group recovery. The actual configuration parameters that people tend to adjust depending on their cluster configuration are: osd recovery op priority osd max backfills osd recovery max active osd recovery threads You can read a little bit about them here: https://ceph.com/docs/master/rados/configuration/osd-config-ref/ Any improvement that makes the model more realistic would be a great contribution. A good first pass might be adding a configuration setting that is be populated by the operator with osd_max_backfills multiplied by the number of osds per host. This way you can divide the restore bandwidth by the number of current restorations to get a per placement group restoration rate. I'm not sure how best to track placement groups that are waiting for a recovery slot. > 3) The above formula doesn't say anything about the replication > latency, the cost of encoding, etc. For the case of erasure, such > values are significant. Furthermore, repairing means read k chunks and > store again k+m chunks. What happen with previous chunks that are > still available? Only the missing chunk is replaced? I wonder how to > adapt the previous formula with the k and m. I add the factor 2 > because the k chunks how to be fetch from somewhere else to do the > re-coding and then k+m chunks have to be stored again. > seconds = float(self.disk.size * self.full *(2k+m)) / (speed * self.pgs) ?? I'm curious how significant of a difference this would make, I'd be interested in seeing some rough tests of encoding and decoding. With the test data you could probably extrapolate the results and get a rough idea of the impact it would have on recovery time. > 4) Declustering: I am not sure if I understand how it works. From the > reliability model "The number of OSDs read-from and written-to is > assumed to be equal to the > specified declustering factor, with each of those transfers happening > at the specified recovery speed." I also read sth written by Sage in > https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg01650.html > But I am confused. Not clear for me how to use self.pgs (in the > context of the tool) and k&m in the formula above. I'll use the example of a Ceph block device, which will be striped across many placement groups. In this example the number of placement groups the block volume can be striped across would be the declustering factor (4MB stripe boundary by default). If we limit the number of placement groups then we decrease the likelihood that a lost placement group will affect any given block device. I'm not sure how erasure coding might affect this, so I'll defer that part to Loic or Sam. -- Kyle -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html