Re: RADOS Bench strange behavior

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 10 Jul 2013 14:10:18 -0700

On Wed, Jul 10, 2013 at 12:38 AM, Erwan Velu <erwan@xxxxxxxxxxxx> wrote:
> Hi,
>
> I've just subscribe the mailing. I'm maybe breaking the thread as I cannot
> "answer to all" ;o)
>
> I'd like to share my research on understanding of this behavior.
>
> A rados put is showing the expected behavior while the rados bench doesn't
> even with a concurrency set to one.
>
> As a new comer, I've been reading the code to understand the difference
> between each "put" vs "bench" approach.
>
> The first one is pretty straightforward and we achieve the IO via do_put
> which call io_ctx.write{full}.
>
> On the other hand, benchmark is using a much more complicated stuff by using
> aio.
> If I understand properly, that's mostly to be able to increase concurrency.
> After a few calls we achieve the write_bench() function which is the main
> loop of the benchmark
> (https://github.com/ceph/ceph/blob/master/src/common/obj_bencher.cc#L302)
>
> That's mostly where I have some troubles understand how it could works as
> expected, here come why :
>
> From this point,
> https://github.com/ceph/ceph/blob/master/src/common/obj_bencher.cc#L330, we
> do prepare objects as much as we do have concurrent_ios.
>
> From this point,
> https://github.com/ceph/ceph/blob/master/src/common/obj_bencher.cc#L344, we
> do spread the IOs as much as we do have concurrent_ios
>
> From this point,
> https://github.com/ceph/ceph/blob/master/src/common/obj_bencher.cc#L368, we
> do start the main loop until we reach the limit (time or amount of objects)
>
> Starting this loop,
> https://github.com/ceph/ceph/blob/master/src/common/obj_bencher.cc#L371, we
> do wait that all sent IOs (up to concurrent_ios) are completed. By the way,
> I didn't understood how the end of IO is detected. AIO supports callbacks,
> signals or polling. Which one is used ? I saw that we rely on
> completion_is_done() which does a  return completions[slot]->complete; I
> only found something here but not sure if it's the good one :
> https://github.com/ceph/ceph/blob/master/src/tools/rest_bench.cc#L329

This code has all been refactored several times, but from my memory we
have an array of completions matching the array of in-flight objects.
When an object op has been completed, its completion gets marked
complete.
So, once we've spun off the initial async io, we enter a loop. The
first thing we do in that loop is look through the list of completions
for one that's marked complete. Then:
> Then we reach
> https://github.com/ceph/ceph/blob/master/src/common/obj_bencher.cc#L389.
> That's where I'm confused. as from my understanding we are rescheduling _a
> single IO_ and get back to the waiting loop. So I don't really got how the
> concurrency is kept.

So here we've found that *one* of the IOs (not all of them) are
completed, and we're spinning up a replacement IO for that one IO that
finished. If more IOs have finished while we were setting that one up
then we'll notice that very quickly in the while(1) loop.

> To be more direct about my thoughts, I do think that somewhere the aio stuff
> does ack the IO too soon and so we are sending a new IO while the previous
> one didn't got complete. That would explain the kind of behavior we do see
> with sebastien.

Yeah, the IO is acking here once it's in the journal, but after that
it still needs to get into the backing store. (Actually IIRC the ack
it's looking for is the in-memory one, but with xfs the journal is
write-ahead and it handles the ordering for you.) However, this
shouldn't really result in any different behavior than what you'd see
with a bunch of looped "rados put" commands (especially on XFS).
My guess, Sébastien, is that the BBU/RAID card is just reordering
things in ways you weren't expecting but that don't show up with the
"rados put" because it's running at about half the speed of the rados
bench.

> As a side note, I saw that  ceph_clock_now is using gettimeofday which is
> not resilient to system date changes (like if a ntp update is occuring).
> clock_gettime with CLOCK_MONOTONIC is clearly prefered for such "time
> difference computing" job.

Hmm, that's probably correct. Getting that kind of flag through the
current interfaces sounds a little annoying, though. :/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com