On Tue, 5 Jan 2016, Guang Yang wrote: > On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Mon, 4 Jan 2016, Guang Yang wrote: > >> Hi Cephers, > >> Happy New Year! I got question regards to the long PG peering.. > >> > >> Over the last several days I have been looking into the *long peering* > >> problem when we start a OSD / OSD host, what I observed was that the > >> two peering working threads were throttled (stuck) when trying to > >> queue new transactions (writing pg log), thus the peering process are > >> dramatically slow down. > >> > >> The first question came to me was, what were the transactions in the > >> queue? The major ones, as I saw, included: > >> > >> - The osd_map and incremental osd_map, this happens if the OSD had > >> been down for a while (in a large cluster), or when the cluster got > >> upgrade, which made the osd_map epoch the down OSD had, was far behind > >> the latest osd_map epoch. During the OSD booting, it would need to > >> persist all those osd_maps and generate lots of filestore transactions > >> (linear with the epoch gap). > >> > As the PG was not involved in most of those epochs, could we only take and persist those osd_maps which matter to the PGs on the OSD? > > > > This part should happen before the OSD sends the MOSDBoot message, before > > anyone knows it exists. There is a tunable threshold that controls how > > recent the map has to be before the OSD tries to boot. If you're > > seeing this in the real world, be probably just need to adjust that value > > way down to something small(er). > It would queue the transactions and then sends out the MOSDBoot, thus > there is still a chance that it could have contention with the peering > OPs (especially on large clusters where there are lots of activities > which generates many osdmap epoch). Any chance we can change the > *queue_transactions* to "apply_transactions*, thus we block there > waiting for the persistent of the osdmap. At least we may be able to > do that during OSD booting? The concern is, if the OSD is active, the > apply_transaction would take longer with holding the osd_lock.. > I don't find such tuning, could you elaborate? Thanks! Yeah, that sounds like a good idea (and clearly safe). Probably a simpler fix is to just call store->flush() or similar before sending the boot message? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html