Hi,
the size of the rbd data objects on the OSD is 4 MByte (default).
Best Regards
Andreas
Samuel Just wrote:
What rbd block size were you using?
-Sam
On Tue, Aug 21, 2012 at 10:29 PM, Andreas Bluemle
<andreas.bluemle@xxxxxxxxxxx> wrote:
Hi,
Samuel Just wrote:
Was the cluster complete healthy at the time that those traces were taken?
If there were osds going in/out/up/down, it would trigger osdmap updates
which
would tend to hold the osd_lock for an extended period of time.
The cluster was completely healthy.
v0.50 included some changes that drastically reduce the purview of
osd_lock.
In particular, pg op handling no longer grabs the osd_lock and
handle_osd_map
can proceed independently of the pg worker threads. Trying that might be
interesting.
I'll grab v0.50 and take a look.
-Sam
On Tue, Aug 21, 2012 at 12:20 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
On Tue, 21 Aug 2012, Sage Weil wrote:
On Tue, 21 Aug 2012, Andreas Bluemle wrote:
Hi Sage,
as mentioned, the workload is a single sequential write on
the client. The write is not O_DIRECT; and consequently
the messages arrive at the OSD with 124 KByte per write request.
The attached pdf shows a timing diagram of two concurrent
write operations (one primary and one replication / secondary).
The time spent on the primary write to get the OSD.:osd_lock
releates nicely with the time when this lock is released by the
secondary write.
Looking again at this diagram, I'm a bit confused. Is the Y access the
thread id or something? And the X axis is time in seconds?
X-Axis is time, Y Axis is absolute offset of the write request on the rados
block device.
The big question for me is what on earth the secondary write (or primary,
for that matter) is doing with osd_lock for a full 3 ms... If my reading
of the units is correct, *that* is the real problem. It shouldn't be
doing anything that takes that long. The exception is osdmap handling,
which can do more work, but request processing should be very fast.
Thanks-
sage
Ah, I see.
There isn't a trivial way to pull osd_lock out of the picture; there are
several data structures it's protecting (pg_map, osdmaps, peer epoch
map,
etc.). Before we try going down that road, I suspect it might be more
fruitful to see where cpu time is being spent while osd_lock is held.
How much of an issue does it look like this specific contention is for
you? Does it go away with larger writes?
sage
Hope this helps
Andreas
Sage Weil wrote:
On Mon, 20 Aug 2012, Andreas Bluemle wrote:
Hi Sage,
Sage Weil wrote:
Hi Andreas,
On Thu, 16 Aug 2012, Andreas Bluemle wrote:
Hi,
I have been trying to migrate a ceph cluster (ceph-0.48argonaut)
to a high speed cluster network and encounter scalability problems:
the overall performance of the ceph cluster does not scale well
with an increase in the underlying networking speed.
In short:
I believe that the dispatching from SimpleMessenger to
OSD worker queues causes that scalability issue.
Question: is it possible that this dispatching is causing
performance
problems?
There is a single 'dispatch' thread that's processing this queue,
and
conveniently perf lets you break down its profiling data on a
per-thread
basis. Once you've ruled out the throttler as the culprit, you
might
try
running the daemon with 'perf record -g -- ceph-osd ...' and then
look
specifically at where that thread is spending its time. We
shouldn't be
burning that much CPU just doing the sanity checks and then handing
requests
off to PGs...
sage
The effect, which I am seeing, may be related to some locking
issue.
As I read the code, there are multiple dispatchers running: one per
SimpleMessenger.
On a typical OSD node, there is
- the instance of the SimpleMessenger processing input from the
client
(primary writes)
- other instances of SimpleMessenger, which process input from
neighbor
OSD
nodes
the latter generate replication writes to the OSD I am looking at.
On the other hand, there is a single instance of the OSD object
within the
ceph-osd daemon.
When dispatching messages to the OSD, then the OSD::osd_lock is held
for
the
complete
process of dispatching; see code below.
When the write load increases, then multiple SimpleMessenger
instances
start
to
congest on the OSD::osd_lock.
And this may cause delays in the individual dispatch process.
This is definitely possible, yes, although it would surprise me if
it's
happening here (unless your workload is all small writes). Just to
confirm,
are you actually observing osd_lock contention, or speculating about
what is
causing the delays you're seeing?
I'm not sure what the best tool is to measure lock contention. Mark
was
playing with a 'poor man's wall clock profiler' using stack trace
sampling
from gdb. That would tell us whether threads were really blocking
while
obtaining the osd_lock...
Can you tell us a bit more about what your workload is?
sage
bool OSD::ms_dispatch(Message *m)
{
// lock!
osd_lock.Lock();
while (dispatch_running) {
dout(10) << "ms_dispatch waiting for other dispatch thread to
complete" <<
dendl;
dispatch_cond.Wait(osd_lock);
}
dispatch_running = true;
do_waiters();
_dispatch(m);
do_waiters();
dispatch_running = false;
dispatch_cond.Signal();
osd_lock.Unlock();
return true;
}
In detail:
In order to find out more about this problem, I have added
profiling
to
the ceph code in various place; for write operations to the primary
or
the
secondary, timestamps are recorded for OSD object, offset and
length
of
the such a write request.
Timestamps record:
- receipt time at SimpleMessenger
- processing time at osd
- for primary write operations: wait time until replication
operation
is acknowledged.
What I believe is happening: dispatching requests from
SimpleMessenger
to
OSD worker threads seems to consume a fair amount of time. This
ends
up in a widening gap between subsequent receipts of requests and
the
start
of OSD processing them.
A primary write suffers twice from this problem: first because
the delay happens on the primary OSD and second because the
replicating
OSD also suffers from the same problem - and hence causes
additional
delays
at the primary OSD when it waits for the commit from the
replicating
OSD.
In the attached graphics, the x-axis shows the time (in seconds)
The y-axis shows the offset where a request to write happened.
The red bar represents the SimpleMessenger receive, i.e. from
reading
the message header until enqueuing the completely decoded message
into
the SImpleMessenger dispatch queue.
The green bar represents the time required for local processing,
i.e.
dispatching the the OSD worker, writing to filesystem and journal,
send
out the replication operation to the replicating OSD. It right
end of the green bar is the time when locally everything has
finished
and a commit could happen.
The blue bar represents the time until the replicating OSD has sent
a
commit
back to the primary OSD and the original write request can be
committed to
the client.
The green bar is interrupted by a black bar: the left end
represents
the time when the request has been enqueued on the OSD worker
queue.
The
right end gives the time when the request is taken off the OSD
worker
queue and actual OSD processing starts.
The test was a simple sequential write to a rados block device.
Receiption of the write requests at the OSD is also sequential in
the
graphics: the bar to the bottom of the graphics shows an earlier
write
request.
Note that the dispatching of a later request in all cases relates
to
the
enqueue time at the OSD worker queue of the previous write request:
the
left
end of a black bar relates nicely to the beginning of a green bar
above
it.
--
Andreas Bluemle
mailto:Andreas.Bluemle@xxxxxxxxxxx
ITXperts GmbH http://www.itxperts.de
Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917
D-81541 Muenchen (Germany) Fax: (+49) 89 89044910
Company details: http://www.itxperts.de/imprint.htm
--
Andreas Bluemle
mailto:Andreas.Bluemle@xxxxxxxxxxx
ITXperts GmbH http://www.itxperts.de
Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917
D-81541 Muenchen (Germany) Fax: (+49) 89 89044910
Company details: http://www.itxperts.de/imprint.htm
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Andreas Bluemle mailto:Andreas.Bluemle@xxxxxxxxxxx
ITXperts GmbH http://www.itxperts.de
Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917
D-81541 Muenchen (Germany) Fax: (+49) 89 89044910
Company details: http://www.itxperts.de/imprint.htm
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Andreas Bluemle mailto:Andreas.Bluemle@xxxxxxxxxxx
ITXperts GmbH http://www.itxperts.de
Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917
D-81541 Muenchen (Germany) Fax: (+49) 89 89044910
Company details: http://www.itxperts.de/imprint.htm
--
Andreas Bluemle mailto:Andreas.Bluemle@xxxxxxxxxxx
ITXperts GmbH http://www.itxperts.de
Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917
D-81541 Muenchen (Germany) Fax: (+49) 89 89044910
Company details: http://www.itxperts.de/imprint.htm
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html