Re: opentracing

Bassam Tabbara <bassam@xxxxxxxxxxx> · Tue, 23 Jan 2018 08:24:07 -0800

(resending due to a bounce)

+1 for using opentracing as the protocol for tracing. Specially, it would be great if each ceph role (MON, OSD, etc.) could spit out open tracing metrics on a local socket, such that a sidecar could collect it using Jaeger or whatever else. It would be great to be able to trace a high level application/user request all the way down to the ceph layer using this approach. This seems like a good first step.

I’m also unsure about building a dependency on Jaeger from ceph-mgr. It seems to me that some of the features like “top” or a ceph dashboard will inherently require more software/services outside of ceph proper (like Jaeger, Prometheus, Web Servers, etc.). If we are looking for software that must become a plugin in ceph-mgr our choices are going to be limited, or we’re going to be building more than we need to. 

Instead, I think “top” could just assume that Jaeger and Prometheus exist in the cluster, and if they don’t it shows basic/limited information. We could provide recipes for running Jaeger and Prometheus in ansible / Kubernetes etc.

My 2 cents.

> On Jan 22, 2018, at 10:12 AM, John Spray <jspray@xxxxxxxxxx> wrote:
> 
> On Mon, Jan 22, 2018 at 5:59 PM, Mohamad Gebai <mgebai@xxxxxxx> wrote:
>> I really doubt anyone uses Blkin. I've tried to get it working on multiple
>> occasions and was never able to. There's also the fact that there were
>> problems in the actual code [1][2] that went unnoticed for a while. :) I
>> don't know if anyone uses the LTTng tracepoints though, but I think they're
>> more for developers than users.
>> 
>> Regarding Sage's email, I think there are different aspects to this:
>> 
>> 1. Debug logs
>> There's a compromise between ease of use and performance that has to be
>> made. For debug logs, we currently use plain text logging, which is the
>> simplest mechanism, but it's not very efficient. Moving to a "third-party"
>> tracer for this might be risky since it requires the cluster admin to
>> actually deal with the tracer (LTTng or other) or make Ceph and the tracer
>> work closely together (enable tracing, manage the trace, etc.). A potential
>> hybrid solution for Ceph would be to rework the current logging subsystem to
>> be more efficient. For example, we could move to compact binary format such
>> as CTF [3]. This would make internal logging much more efficient in time (no
>> locking or string handling required) and disk space.
>> 
>> 2. rbd top and rados top
>> I don't know enough about what this feature implies to add anything to the
>> discussion, but from what I understand, tracing might not be a good match
>> for it. Do we really need tracing to get the top consumers of ops in the
>> cluster? If we want the full history and the ability to replay, then sure.
>> Otherwise, it might be more efficient to just do the bookkeeping internally
>> (similar to perf dumps). Maybe I'm missing something or misunderstood these
>> features?
> 
> Just to fill out the context here, when we talk about "rbd top" it's a
> reference to some mechanism like
> (http://tracker.ceph.com/projects/ceph/wiki/Live_Performance_Probes),
> or any other mechanism that lets people see their system activity
> broken down by which client is most active, which rbd image is most
> busy, etc.
> 
> The tracing relevance comes from the proposed underlying mechanism,
> which is to sample the ops on the OSD (and MDS for cephfs) and then
> aggregate them based on what the user is interested in (i.e. those
> per-client, per-image, etc) counts -- that's what makes it a tracing
> system, albeit with a very small number of tracepoints.
> 
> John
> 
>> 
>> 3. Distributed tracing (Jaeger/OpenTracing)
>> I agree with John that this might add significant infrastructure complexity.
>> However, it could be that instrumenting the code according to the
>> specification is enough, and let the user take care of plugging their tracer
>> and infrastructure to it. I'm happy to do more research and see how this can
>> be used in Ceph.
>> 
>> Any thought on these points?
>> 
>> [1]
>> https://github.com/ceph/blkin/commit/ad1302e8aadadc1b68489a5576ee2d350fd502a3#diff-d470631b3b4002102866f0bcd25fcf5eL61
>> [2]
>> https://github.com/ceph/ceph/commit/9049a97d414332d0d37a80a05c44d76b8dd5e8a0
>> [3] http://diamon.org/ctf/
>> 
>> Mohamad
>> 
>> 
>> 
>> On 01/22/2018 05:07 AM, John Spray wrote:
>> 
>> On Fri, Jan 19, 2018 at 7:52 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> 
>> I think we need to take a step back and reconsider our approach to
>> tracing.  Thus far, it has been an ad hoc combination of our
>> home-brew debug logs and a few lttng tracepoints.  We have some initial
>> integration of blkin tracepoints as well, but I'm not sure if anyone has
>> actually used them.
>> 
>> I'm looking at opentracing.io (see e.g.
>> http://opentracing.io/documentation/) and this looks like a more viable
>> path forward since it is not tied to specific tracing tools and is being
>> adopted by CNCF projects.  There's also the new Jaeger tool that is (from
>> what I gather) a newer dapper/zipkin type tool that will presumably be
>> usable if we go this path.
>> 
>> I was on a call recently with a partner and they mentioned that their tool
>> would consume tracepoints via opentracing tracepoints and that one of the
>> key features was that it would sample instead of pulling an exhaustive
>> trace.  (From what I gather this is outside of the opentracing library in
>> the application, though--it's up to the tracer to sample or not to
>> sample.)
>> 
>> One of the looming features that we'd like to work on now that the mgr is
>> in place is a 'rados top' or 'rbd top' like function that samples requests
>> at the OSDs (or clients?) to build an aggregate view of the top consumers
>> of ops in the cluster.  I'm wondering whether it makes sense to build this
>> sort of functionality on top of generic tracepoints instead of our own
>> purpose-built instrumentation.
>> 
>> Given that the main "top" style stuff all comes from a single
>> tracepoint per daemon ("handle op"), it seems like the actual tracing
>> library is a relatively unimportant piece, unless there is something
>> special about the way it does sampling.  If the "top" stuff can use a
>> generic tracing library then that's probably more of a bonus than a
>> driver for adoption.
>> 
>> For the central aggregation piece, I'm a little suspicious of packages
>> like Jaeger that back onto a full Cassandra/Elasticsearch backend --
>> while they claim good performance, I don't know how big those database
>> servers have to be for it all to work well.  For something to be "out
>> of the box" on Ceph (i.e. not require users to think about extra
>> hardware) we need things that will work with relatively constrained
>> system resources.
>> 
>> It's definitely worth investigating though.
>> 
>> John
>> 
>> 
>> 
>> Is there anyone who is interested in heading this effort/investigation up?
>> 
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html