Re: rados bench output question

mj <lists@xxxxxxxxxxxxx> · Thu, 8 Sep 2016 09:44:23 +0200

Hi Christian,

Thanks a lot for all your information!

(specially the bit that ceph never reads from the journal, but writes to 
osd from memory was new for me)

MJ

On 09/07/2016 03:20 AM, Christian Balzer wrote:

hello,

On Tue, 6 Sep 2016 13:38:45 +0200 lists wrote:

Hi Christian,

Thanks for your reply.

What SSD model (be precise)?
Samsung 480GB PM863 SSD

So that's not your culprit then (they are supposed to handle sync writes
at full speed).

Only one SSD?
Yes. With a 5GB partition based journal for each osd.

A bit small, but in normal scenarios that shouldn't be a problem.
Read:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html

During the 0 MB/sec, there is NO increased cpu usage: it is usually
around 15 - 20% for the four ceph-osd processes.

Watch your node(s) with atop or iostat.
Ok, I will do.

Best results will be had with 3 large terminals (one per node) running
atop, interval set to at least 5, down from default 10 seconds.
Same diff with iostat, parameters "-x 2".

Do we have an issue..? And if yes: Anyone with a suggestions where to
look at?

You will find that either your journal SSD is overwhelmed and a single
SSD peaking around 500MB/s wouldn't be that surprising.
Or that your HDDs can't scribble away at more than the speed above, the
more likely reason.
Even a combination of both.

Ceph needs to flush data to the OSDs eventually (and that is usually more
or less immediately with default parameters), so for a sustained,
sequential write test you're looking at the speed of your HDDs.
And that will be spiky of sorts, due to FS journals, seeks for other
writes (replicas), etc.
But would we expect the MB/sec to drop to ZERO, during journal-to-osd
flushes?

A common misconception when people start up with Ceph and probably
something that should be better explained in the docs. Or not, given that
Blustore is on the shimmering horizon.

Ceph never reads from the journals, unless there has been a crash.
(Now would be a good time to read that link above if you haven't yet)

What happens is that (depending on the various filestore and journal
parameters) Ceph starts flushing the still in memory data to the OSD
(disk, FS) after the journal has been written, as I mentioned above.

The logic here is to not create an I/O storm after letting things pile up
for a long time.
People with fast storage subsystems and/or SSDs/NVMes as OSDs tend to tune
these parameters.

So now think about what happens during that rados bench run:
A 4MB object gets written (created, then filled), so the client talks to
the OSD that holds the primary PG for that object.
That OSD writes the data to the journal and sends it to the other OSDs
(replicas).
Once all journals have been written, the primary OSD acks the write to
the client.

And this happens with 16 threads by default, making things nicely busy.
Now keeping in mind the above description and the fact that you have a
small cluster, a single OSD that gets too busy will block the whole
cluster basically.

So things dropping to zero means that at least one OSD was so busy (not
CPU in your case, IOwait) that it couldn't take in more data.
The fact that your drops happen in a rather predictable, roughly 9
seconds interval, suggests also the possibility that the actual journal
got full, but that's not conclusive.

Christian

Thanks for the quick feedback, and I'll dive into atop and iostat next.

Regards,
MJ
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com