Re: rados bench output question

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Sep 2016 10:20:10 +0900

hello,

On Tue, 6 Sep 2016 13:38:45 +0200 lists wrote:

> Hi Christian,
> 
> Thanks for your reply.
> 
> > What SSD model (be precise)?
> Samsung 480GB PM863 SSD
> 
So that's not your culprit then (they are supposed to handle sync writes
at full speed).

> > Only one SSD?
> Yes. With a 5GB partition based journal for each osd.
>
A bit small, but in normal scenarios that shouldn't be a problem.
Read:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html

> >> During the 0 MB/sec, there is NO increased cpu usage: it is usually
> >> around 15 - 20% for the four ceph-osd processes.
> >>
> > Watch your node(s) with atop or iostat.
> Ok, I will do.
> 
Best results will be had with 3 large terminals (one per node) running
atop, interval set to at least 5, down from default 10 seconds.
Same diff with iostat, parameters "-x 2".

> >> Do we have an issue..? And if yes: Anyone with a suggestions where to
> >> look at?
> >>
> > You will find that either your journal SSD is overwhelmed and a single
> > SSD peaking around 500MB/s wouldn't be that surprising.
> > Or that your HDDs can't scribble away at more than the speed above, the
> > more likely reason.
> > Even a combination of both.
> >
> > Ceph needs to flush data to the OSDs eventually (and that is usually more
> > or less immediately with default parameters), so for a sustained,
> > sequential write test you're looking at the speed of your HDDs.
> > And that will be spiky of sorts, due to FS journals, seeks for other
> > writes (replicas), etc.
> But would we expect the MB/sec to drop to ZERO, during journal-to-osd 
> flushes?
> 
A common misconception when people start up with Ceph and probably
something that should be better explained in the docs. Or not, given that
Blustore is on the shimmering horizon.

Ceph never reads from the journals, unless there has been a crash.
(Now would be a good time to read that link above if you haven't yet)

What happens is that (depending on the various filestore and journal
parameters) Ceph starts flushing the still in memory data to the OSD
(disk, FS) after the journal has been written, as I mentioned above.

The logic here is to not create an I/O storm after letting things pile up
for a long time.
People with fast storage subsystems and/or SSDs/NVMes as OSDs tend to tune
these parameters.

So now think about what happens during that rados bench run:
A 4MB object gets written (created, then filled), so the client talks to
the OSD that holds the primary PG for that object.
That OSD writes the data to the journal and sends it to the other OSDs
(replicas).
Once all journals have been written, the primary OSD acks the write to
the client.

And this happens with 16 threads by default, making things nicely busy.
Now keeping in mind the above description and the fact that you have a
small cluster, a single OSD that gets too busy will block the whole
cluster basically. 

So things dropping to zero means that at least one OSD was so busy (not
CPU in your case, IOwait) that it couldn't take in more data.
The fact that your drops happen in a rather predictable, roughly 9
seconds interval, suggests also the possibility that the actual journal
got full, but that's not conclusive.

Christian

> Thanks for the quick feedback, and I'll dive into atop and iostat next.
> 
> Regards,
> MJ
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com