What rbd block size were you using? -Sam On Tue, Aug 21, 2012 at 10:29 PM, Andreas Bluemle <andreas.bluemle@xxxxxxxxxxx> wrote: > Hi, > > > Samuel Just wrote: >> >> Was the cluster complete healthy at the time that those traces were taken? >> If there were osds going in/out/up/down, it would trigger osdmap updates >> which >> would tend to hold the osd_lock for an extended period of time. >> >> > > The cluster was completely healthy. > >> v0.50 included some changes that drastically reduce the purview of >> osd_lock. >> In particular, pg op handling no longer grabs the osd_lock and >> handle_osd_map >> can proceed independently of the pg worker threads. Trying that might be >> interesting. >> >> > > I'll grab v0.50 and take a look. > > >> -Sam >> >> On Tue, Aug 21, 2012 at 12:20 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: >> >>> >>> On Tue, 21 Aug 2012, Sage Weil wrote: >>> >>>> >>>> On Tue, 21 Aug 2012, Andreas Bluemle wrote: >>>> >>>>> >>>>> Hi Sage, >>>>> >>>>> as mentioned, the workload is a single sequential write on >>>>> the client. The write is not O_DIRECT; and consequently >>>>> the messages arrive at the OSD with 124 KByte per write request. >>>>> >>>>> The attached pdf shows a timing diagram of two concurrent >>>>> write operations (one primary and one replication / secondary). >>>>> >>>>> The time spent on the primary write to get the OSD.:osd_lock >>>>> releates nicely with the time when this lock is released by the >>>>> secondary write. >>>>> >>> >>> Looking again at this diagram, I'm a bit confused. Is the Y access the >>> thread id or something? And the X axis is time in seconds? >>> >>> > > X-Axis is time, Y Axis is absolute offset of the write request on the rados > block device. > >>> The big question for me is what on earth the secondary write (or primary, >>> for that matter) is doing with osd_lock for a full 3 ms... If my reading >>> of the units is correct, *that* is the real problem. It shouldn't be >>> doing anything that takes that long. The exception is osdmap handling, >>> which can do more work, but request processing should be very fast. >>> >>> Thanks- >>> sage >>> >>> >>> >>>> >>>> Ah, I see. >>>> >>>> There isn't a trivial way to pull osd_lock out of the picture; there are >>>> several data structures it's protecting (pg_map, osdmaps, peer epoch >>>> map, >>>> etc.). Before we try going down that road, I suspect it might be more >>>> fruitful to see where cpu time is being spent while osd_lock is held. >>>> >>>> How much of an issue does it look like this specific contention is for >>>> you? Does it go away with larger writes? >>>> >>>> sage >>>> >>>> >>>> >>>>> >>>>> Hope this helps >>>>> >>>>> Andreas >>>>> >>>>> >>>>> >>>>> Sage Weil wrote: >>>>> >>>>>> >>>>>> On Mon, 20 Aug 2012, Andreas Bluemle wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> Hi Sage, >>>>>>> >>>>>>> Sage Weil wrote: >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Hi Andreas, >>>>>>>> >>>>>>>> On Thu, 16 Aug 2012, Andreas Bluemle wrote: >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I have been trying to migrate a ceph cluster (ceph-0.48argonaut) >>>>>>>>> to a high speed cluster network and encounter scalability problems: >>>>>>>>> the overall performance of the ceph cluster does not scale well >>>>>>>>> with an increase in the underlying networking speed. >>>>>>>>> >>>>>>>>> In short: >>>>>>>>> >>>>>>>>> I believe that the dispatching from SimpleMessenger to >>>>>>>>> OSD worker queues causes that scalability issue. >>>>>>>>> >>>>>>>>> Question: is it possible that this dispatching is causing >>>>>>>>> performance >>>>>>>>> problems? >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> There is a single 'dispatch' thread that's processing this queue, >>>>>>>> and >>>>>>>> conveniently perf lets you break down its profiling data on a >>>>>>>> per-thread >>>>>>>> basis. Once you've ruled out the throttler as the culprit, you >>>>>>>> might >>>>>>>> try >>>>>>>> running the daemon with 'perf record -g -- ceph-osd ...' and then >>>>>>>> look >>>>>>>> specifically at where that thread is spending its time. We >>>>>>>> shouldn't be >>>>>>>> burning that much CPU just doing the sanity checks and then handing >>>>>>>> requests >>>>>>>> off to PGs... >>>>>>>> >>>>>>>> sage >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> The effect, which I am seeing, may be related to some locking >>>>>>> issue. >>>>>>> As I read the code, there are multiple dispatchers running: one per >>>>>>> SimpleMessenger. >>>>>>> >>>>>>> On a typical OSD node, there is >>>>>>> >>>>>>> - the instance of the SimpleMessenger processing input from the >>>>>>> client >>>>>>> (primary writes) >>>>>>> - other instances of SimpleMessenger, which process input from >>>>>>> neighbor >>>>>>> OSD >>>>>>> nodes >>>>>>> >>>>>>> the latter generate replication writes to the OSD I am looking at. >>>>>>> >>>>>>> On the other hand, there is a single instance of the OSD object >>>>>>> within the >>>>>>> ceph-osd daemon. >>>>>>> When dispatching messages to the OSD, then the OSD::osd_lock is held >>>>>>> for >>>>>>> the >>>>>>> complete >>>>>>> process of dispatching; see code below. >>>>>>> >>>>>>> When the write load increases, then multiple SimpleMessenger >>>>>>> instances >>>>>>> start >>>>>>> to >>>>>>> congest on the OSD::osd_lock. >>>>>>> And this may cause delays in the individual dispatch process. >>>>>>> >>>>>>> >>>>>> >>>>>> This is definitely possible, yes, although it would surprise me if >>>>>> it's >>>>>> happening here (unless your workload is all small writes). Just to >>>>>> confirm, >>>>>> are you actually observing osd_lock contention, or speculating about >>>>>> what is >>>>>> causing the delays you're seeing? >>>>>> >>>>>> I'm not sure what the best tool is to measure lock contention. Mark >>>>>> was >>>>>> playing with a 'poor man's wall clock profiler' using stack trace >>>>>> sampling >>>>>> from gdb. That would tell us whether threads were really blocking >>>>>> while >>>>>> obtaining the osd_lock... >>>>>> >>>>>> Can you tell us a bit more about what your workload is? >>>>>> >>>>>> sage >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> bool OSD::ms_dispatch(Message *m) >>>>>>> { >>>>>>> // lock! >>>>>>> osd_lock.Lock(); >>>>>>> >>>>>>> while (dispatch_running) { >>>>>>> dout(10) << "ms_dispatch waiting for other dispatch thread to >>>>>>> complete" << >>>>>>> dendl; >>>>>>> dispatch_cond.Wait(osd_lock); >>>>>>> } >>>>>>> dispatch_running = true; >>>>>>> >>>>>>> do_waiters(); >>>>>>> _dispatch(m); >>>>>>> do_waiters(); >>>>>>> >>>>>>> dispatch_running = false; >>>>>>> dispatch_cond.Signal(); >>>>>>> >>>>>>> osd_lock.Unlock(); >>>>>>> return true; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> In detail: >>>>>>>>> >>>>>>>>> In order to find out more about this problem, I have added >>>>>>>>> profiling >>>>>>>>> to >>>>>>>>> the ceph code in various place; for write operations to the primary >>>>>>>>> or >>>>>>>>> the >>>>>>>>> secondary, timestamps are recorded for OSD object, offset and >>>>>>>>> length >>>>>>>>> of >>>>>>>>> the such a write request. >>>>>>>>> >>>>>>>>> Timestamps record: >>>>>>>>> - receipt time at SimpleMessenger >>>>>>>>> - processing time at osd >>>>>>>>> - for primary write operations: wait time until replication >>>>>>>>> operation >>>>>>>>> is acknowledged. >>>>>>>>> >>>>>>>>> What I believe is happening: dispatching requests from >>>>>>>>> SimpleMessenger >>>>>>>>> to >>>>>>>>> OSD worker threads seems to consume a fair amount of time. This >>>>>>>>> ends >>>>>>>>> up in a widening gap between subsequent receipts of requests and >>>>>>>>> the >>>>>>>>> start >>>>>>>>> of OSD processing them. >>>>>>>>> >>>>>>>>> A primary write suffers twice from this problem: first because >>>>>>>>> the delay happens on the primary OSD and second because the >>>>>>>>> replicating >>>>>>>>> OSD also suffers from the same problem - and hence causes >>>>>>>>> additional >>>>>>>>> delays >>>>>>>>> at the primary OSD when it waits for the commit from the >>>>>>>>> replicating >>>>>>>>> OSD. >>>>>>>>> >>>>>>>>> In the attached graphics, the x-axis shows the time (in seconds) >>>>>>>>> The y-axis shows the offset where a request to write happened. >>>>>>>>> >>>>>>>>> The red bar represents the SimpleMessenger receive, i.e. from >>>>>>>>> reading >>>>>>>>> the message header until enqueuing the completely decoded message >>>>>>>>> into >>>>>>>>> the SImpleMessenger dispatch queue. >>>>>>>>> >>>>>>>>> The green bar represents the time required for local processing, >>>>>>>>> i.e. >>>>>>>>> dispatching the the OSD worker, writing to filesystem and journal, >>>>>>>>> send >>>>>>>>> out the replication operation to the replicating OSD. It right >>>>>>>>> end of the green bar is the time when locally everything has >>>>>>>>> finished >>>>>>>>> and a commit could happen. >>>>>>>>> >>>>>>>>> The blue bar represents the time until the replicating OSD has sent >>>>>>>>> a >>>>>>>>> commit >>>>>>>>> back to the primary OSD and the original write request can be >>>>>>>>> committed to >>>>>>>>> the client. >>>>>>>>> >>>>>>>>> The green bar is interrupted by a black bar: the left end >>>>>>>>> represents >>>>>>>>> the time when the request has been enqueued on the OSD worker >>>>>>>>> queue. >>>>>>>>> The >>>>>>>>> right end gives the time when the request is taken off the OSD >>>>>>>>> worker >>>>>>>>> queue and actual OSD processing starts. >>>>>>>>> >>>>>>>>> The test was a simple sequential write to a rados block device. >>>>>>>>> >>>>>>>>> Receiption of the write requests at the OSD is also sequential in >>>>>>>>> the >>>>>>>>> graphics: the bar to the bottom of the graphics shows an earlier >>>>>>>>> write >>>>>>>>> request. >>>>>>>>> >>>>>>>>> Note that the dispatching of a later request in all cases relates >>>>>>>>> to >>>>>>>>> the >>>>>>>>> enqueue time at the OSD worker queue of the previous write request: >>>>>>>>> the >>>>>>>>> left >>>>>>>>> end of a black bar relates nicely to the beginning of a green bar >>>>>>>>> above >>>>>>>>> it. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Andreas Bluemle >>>>>>>>> mailto:Andreas.Bluemle@xxxxxxxxxxx >>>>>>>>> ITXperts GmbH http://www.itxperts.de >>>>>>>>> Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917 >>>>>>>>> D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 >>>>>>>>> >>>>>>>>> Company details: http://www.itxperts.de/imprint.htm >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Andreas Bluemle >>>>>>> mailto:Andreas.Bluemle@xxxxxxxxxxx >>>>>>> ITXperts GmbH http://www.itxperts.de >>>>>>> Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917 >>>>>>> D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 >>>>>>> >>>>>>> Company details: http://www.itxperts.de/imprint.htm >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>> in >>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Andreas Bluemle mailto:Andreas.Bluemle@xxxxxxxxxxx >>>>> ITXperts GmbH http://www.itxperts.de >>>>> Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917 >>>>> D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 >>>>> >>>>> Company details: http://www.itxperts.de/imprint.htm >>>>> >>>>> >>>>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> >> >> > > > > -- > Andreas Bluemle mailto:Andreas.Bluemle@xxxxxxxxxxx > ITXperts GmbH http://www.itxperts.de > Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917 > D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 > > Company details: http://www.itxperts.de/imprint.htm > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html