> =================Faster Peering/Lower Tail Latency==================== > > https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd% > 3A_Faster_Peering > > https://wiki.ceph.com/Planning/Blueprints/Infernalis/Improve_tail_latency > > http://pad.ceph.com/p/I-faster-peering_tailing > > In addition to what is in the blueprint, Sage suggested that the primary > in some cases can keep the peer_info and peer_missing sets which it > already has if the acting set stays the same or shrinks. > > We also touched on prepopulating pg_temp at the monitor and setting a > different temp pg primary at the monitor in the map which marks an osd > back up to avoid that pg being primary immediately (and having to block > reads and writes on recovery). > Hi Sam, With our experience, the peering is more painful when the OSD(s) stayed down (but still in) for a while and then got up, for example, the OSD crashed or one OSD host crashed without notice (or it takes time to repair the hardware), when it is up, it will need to populate the PG::recovery_map, say there are N objects missing, and there are M replicas, currently the complexity of the search for missing is N*M*logN. When N is large (OSD down for a while), and M is large (EC pool), and many PGs are going through this process, it is non-trivial. Tracker #9558 has some logs with more details. I am thinking a simple optimization is to detect the case that only 1 replica (in the actingbackfill set) has missing and all others are complete, we can simply populate the recovery_map by specifying (M - 1) replicas who does not have any missing as recovery source, this could improve the complexity to N*logN. Does that make sense? If it does, I will go ahead providing a patch. Thanks, Guang ?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f