Re: SimpleMessenger dispatching: cause of performance problems?

Andreas Bluemle <andreas.bluemle@xxxxxxxxxxx> · Fri, 17 Aug 2012 09:09:03 +0200

Hi,

please find attached for ceph-0.48argonaut which adds the
timestamp code. The only binary which utilizes these timestamps
is the ceph-osd daemon.

Timestamps are kept in-memory.
Configuration and extraction of timestamps is
achieved through the "ceph osd tell" interface.

Timestamps are only recorded for objects which match a
filtering string.

ceph tell osd.0 code_profile help
 profiling data
 code_profile operations:
    info:  get summary info
    setfilter <regex>: set filter for object IDs
    completed-oids: object IDS for which complete profiling data is available
    pending-oids: object IDS for which incomplete profiling data is available
    get-completed [prefix]:  get profiling data for completed objects
    get-primary-writes [prefix]:  get condensed profiling data for primary write operations
    get-secondary-writes [prefix]:  get condensed profiling data for replica write operations
    get-pending [prefix]:  get profiling data for pending operations
    clear: free storage of profiled data and clear the object ID filter

In order to collect data for an object in a rados block device,
the object ID filter can be set like:

ceph tell osd.0 code_profile setfilter rb.0.0.* 

The timestamp code adds it's own logging susbsystem and hence
is configurable via ceph.conf, like:

debug ITX = 2

>From where a timestamp is recorded is specified in the src/common/ITX.h:enum itx_checkpoint.

Note that the patch reflects a just a snapshot in the current
process of anlysis. Example: checkpoints ITX_OSD_DISPATCH4 and ITX_OSD_DISPATCH5
have been used in

       OSD::ms_dispatch(Message *m)

just around the call to _dispatch() and show that
the most of the computing time required for dispatching is spent in this routine.

I threrefore don't believe that throttling is causing the delays.

Best Regards

Andreas

On Thu, 16 Aug 2012 09:44:23 -0700
Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:

> On Thu, Aug 16, 2012 at 9:08 AM, Andreas Bluemle
> <andreas.bluemle@xxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > I have been trying to migrate a ceph cluster (ceph-0.48argonaut)
> > to a high speed cluster network and encounter scalability problems:
> > the overall performance of the ceph cluster does not scale well
> > with an increase in the underlying networking speed.
> >
> > In short:
> >
> > I believe that the dispatching from SimpleMessenger to
> > OSD worker queues causes that scalability issue.
> >
> > Question: is it possible that this dispatching is causing
> > performance problems?
> >
> >
> > In detail:
> >
> > In order to find out more about this problem, I have added
> > profiling to the ceph code in various place; for write operations
> > to the primary or the secondary, timestamps are recorded for OSD
> > object, offset and length of the such a write request.
> >
> > Timestamps record:
> >  - receipt time at SimpleMessenger
> >  - processing time at osd
> >  - for primary write operations: wait time until replication
> > operation is acknowledged.
> 
> Did you make any code changes? We'd love to see those.
> 
> >
> > What I believe is happening: dispatching requests from
> > SimpleMessenger to OSD worker threads seems to consume a fair
> > amount of time. This ends up in a widening gap between subsequent
> > receipts of requests and the start of OSD processing them.
> >
> > A primary write suffers twice from this problem: first because
> > the delay happens on the primary OSD and second because the
> > replicating OSD also suffers from the same problem - and hence
> > causes additional delays
> > at the primary OSD when it waits for the commit from the
> > replicating OSD.
> >
> > In the attached graphics, the x-axis shows the time (in seconds)
> > The y-axis shows the offset where a request to write happened.
> >
> > The red bar represents the SimpleMessenger receive, i.e. from
> > reading the message header until enqueuing the completely decoded
> > message into the SImpleMessenger dispatch queue.
> 
> Could it be that messages were throttled here?
> There's a configurable that can be set (ms dispatch throttle bytes),
> might affect that.
> 
> >
> > The green bar represents the time required for local processing,
> > i.e. dispatching the the OSD worker, writing to filesystem and
> > journal, send out the replication operation to the replicating OSD.
> > It right end of the green bar is the time when locally everything
> > has finished and a commit could happen.
> >
> > The blue bar represents the time until the replicating OSD has sent
> > a commit
> > back to the primary OSD and the original write request can be
> > committed to the client.
> >
> > The green bar is interrupted by a black bar: the left end represents
> > the time when the request has been enqueued on the OSD worker
> > queue. The right end gives the time when the request is taken off
> > the OSD worker queue and actual OSD processing starts.
> >
> > The test was a simple sequential write to a rados block device.
> >
> > Receiption of the write requests at the OSD is also sequential in
> > the graphics: the bar to the bottom of the graphics shows an
> > earlier write request.
> >
> > Note that the dispatching of a later request in all cases relates
> > to the enqueue time at the OSD worker queue of the previous write
> > request: the left
> > end of a black bar relates nicely to the beginning of a green bar
> > above it.
> >
> >
> 
> Thanks,
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

-- 
Andreas Bluemle                     mailto:Andreas.Bluemle@xxxxxxxxxxx
ITXperts GmbH                       http://www.itxperts.de
Balanstrasse 73, Geb. 08            Phone: (+49) 89 89044917
D-81541 Muenchen (Germany)          Fax:   (+49) 89 89044910

Company details: http://www.itxperts.de/imprint.htm
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html