RE: perf counters from a performance discrepancy

Sage Weil <sweil@xxxxxxxxxx> · Thu, 8 Oct 2015 08:55:51 -0700 (PDT)

On Thu, 8 Oct 2015, Deneau, Tom wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > Sent: Wednesday, October 07, 2015 9:48 PM
> > To: Deneau, Tom
> > Cc: Mark Nelson; Gregory Farnum; ceph-devel@xxxxxxxxxxxxxxx
> > Subject: RE: perf counters from a performance discrepancy
> > 
> > > I finally got around to looking at the dump_historic_ops output for
> > > the 1-client and 2-client cases.
> > > As you recall these are all read-ops.  so the events in the dump are
> > >    initiated
> > >    reached_pg
> > >    started
> > >    done
> > >
> > > The pattern I see for most of the slow ops recorded in the dump is:
> > >
> > >    * In the 1-client case the typical slow op has duration between 50-65
> > ms
> > >      and usually most of this is the interval between reached_pg and
> > started.
> > >
> > >    * In the 2-client case the typical slow op has duration between 95-
> > 120 ms
> > >      and again usually most of this is the interval between reached_pg
> > >      and started.
> > >
> > > Could someone describe what the interval between reached_pg and
> > > started means?
> > 
> > I think the slow part is probably find_object_context() (although to be
> > fair tons of stuff happens here, see do_op()).  You could test this theory
> > or otherwise narrow this down with additional event markers lke
> > 
> > diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index
> > d6f3084..6faccc2 100644
> > --- a/src/osd/ReplicatedPG.cc
> > +++ b/src/osd/ReplicatedPG.cc
> > @@ -1691,10 +1691,12 @@ void ReplicatedPG::do_op(OpRequestRef& op)
> >      return;
> >    }
> > 
> > +  op->mark_event("about to find");
> >    int r = find_object_context(
> >      oid, &obc, can_create,
> >      m->has_flag(CEPH_OSD_FLAG_MAP_SNAP_CLONE),
> >      &missing_oid);
> > +  op->mark_event("found");
> > 
> >    if (r == -EAGAIN) {
> >      // If we're not the primary of this OSD, and we have
> > 
> > 
> > sage
> 
> Sage --
> 
> Is it likely that find_object_context would take longer when there are two clients
> each using their own pool (compared to one client using one pool)?
> 
> And would two clients using the same pool spend less time in find_object_context?

Maybe.. find_object_context is where we lookup the file, and if there 
are caching effects that would be one place where we'd see it.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html