On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Mon, 4 Jan 2016, Guang Yang wrote: >> Hi Cephers, >> Happy New Year! I got question regards to the long PG peering.. >> >> Over the last several days I have been looking into the *long peering* >> problem when we start a OSD / OSD host, what I observed was that the >> two peering working threads were throttled (stuck) when trying to >> queue new transactions (writing pg log), thus the peering process are >> dramatically slow down. >> >> The first question came to me was, what were the transactions in the >> queue? The major ones, as I saw, included: >> >> - The osd_map and incremental osd_map, this happens if the OSD had >> been down for a while (in a large cluster), or when the cluster got >> upgrade, which made the osd_map epoch the down OSD had, was far behind >> the latest osd_map epoch. During the OSD booting, it would need to >> persist all those osd_maps and generate lots of filestore transactions >> (linear with the epoch gap). >> > As the PG was not involved in most of those epochs, could we only take and persist those osd_maps which matter to the PGs on the OSD? > > This part should happen before the OSD sends the MOSDBoot message, before > anyone knows it exists. There is a tunable threshold that controls how > recent the map has to be before the OSD tries to boot. If you're > seeing this in the real world, be probably just need to adjust that value > way down to something small(er). It would queue the transactions and then sends out the MOSDBoot, thus there is still a chance that it could have contention with the peering OPs (especially on large clusters where there are lots of activities which generates many osdmap epoch). Any chance we can change the *queue_transactions* to "apply_transactions*, thus we block there waiting for the persistent of the osdmap. At least we may be able to do that during OSD booting? The concern is, if the OSD is active, the apply_transaction would take longer with holding the osd_lock.. I don't find such tuning, could you elaborate? Thanks! > > sage > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com