Re: Long peering - throttle at FileStore::queue_transactions

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 6 Jan 2016 09:09:42 -0500 (EST)

On Tue, 5 Jan 2016, Guang Yang wrote:
> On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Mon, 4 Jan 2016, Guang Yang wrote:
> >> Hi Cephers,
> >> Happy New Year! I got question regards to the long PG peering..
> >>
> >> Over the last several days I have been looking into the *long peering*
> >> problem when we start a OSD / OSD host, what I observed was that the
> >> two peering working threads were throttled (stuck) when trying to
> >> queue new transactions (writing pg log), thus the peering process are
> >> dramatically slow down.
> >>
> >> The first question came to me was, what were the transactions in the
> >> queue? The major ones, as I saw, included:
> >>
> >> - The osd_map and incremental osd_map, this happens if the OSD had
> >> been down for a while (in a large cluster), or when the cluster got
> >> upgrade, which made the osd_map epoch the down OSD had, was far behind
> >> the latest osd_map epoch. During the OSD booting, it would need to
> >> persist all those osd_maps and generate lots of filestore transactions
> >> (linear with the epoch gap).
> >> > As the PG was not involved in most of those epochs, could we only take and persist those osd_maps which matter to the PGs on the OSD?
> >
> > This part should happen before the OSD sends the MOSDBoot message, before
> > anyone knows it exists.  There is a tunable threshold that controls how
> > recent the map has to be before the OSD tries to boot.  If you're
> > seeing this in the real world, be probably just need to adjust that value
> > way down to something small(er).
> It would queue the transactions and then sends out the MOSDBoot, thus
> there is still a chance that it could have contention with the peering
> OPs (especially on large clusters where there are lots of activities
> which generates many osdmap epoch). Any chance we can change the
> *queue_transactions* to "apply_transactions*, thus we block there
> waiting for the persistent of the osdmap. At least we may be able to
> do that during OSD booting? The concern is, if the OSD is active, the
> apply_transaction would take longer with holding the osd_lock..
> I don't find such tuning, could you elaborate? Thanks!

Yeah, that sounds like a good idea (and clearly safe).  Probably a simpler 
fix is to just call store->flush() or similar before sending the boot 
message?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html