Re: rados bench output question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On the topic of understanding the general ebb and flow of a cluster's throughput:
-Is there a way to monitor/observe how full a journal partition becomes before it is flushed?

I've been interested in increasing max sync interval from its default for my 10GB journals, but would like to know more about my journals before I go tuning.


On Sep 6, 2016 8:20 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:

hello,

On Tue, 6 Sep 2016 13:38:45 +0200 lists wrote:

> Hi Christian,
>
> Thanks for your reply.
>
> > What SSD model (be precise)?
> Samsung 480GB PM863 SSD
>
So that's not your culprit then (they are supposed to handle sync writes
at full speed).

> > Only one SSD?
> Yes. With a 5GB partition based journal for each osd.
>
A bit small, but in normal scenarios that shouldn't be a problem.
Read:
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html

> >> During the 0 MB/sec, there is NO increased cpu usage: it is usually
> >> around 15 - 20% for the four ceph-osd processes.
> >>
> > Watch your node(s) with atop or iostat.
> Ok, I will do.
>
Best results will be had with 3 large terminals (one per node) running
atop, interval set to at least 5, down from default 10 seconds.
Same diff with iostat, parameters "-x 2".

> >> Do we have an issue..? And if yes: Anyone with a suggestions where to
> >> look at?
> >>
> > You will find that either your journal SSD is overwhelmed and a single
> > SSD peaking around 500MB/s wouldn't be that surprising.
> > Or that your HDDs can't scribble away at more than the speed above, the
> > more likely reason.
> > Even a combination of both.
> >
> > Ceph needs to flush data to the OSDs eventually (and that is usually more
> > or less immediately with default parameters), so for a sustained,
> > sequential write test you're looking at the speed of your HDDs.
> > And that will be spiky of sorts, due to FS journals, seeks for other
> > writes (replicas), etc.
> But would we expect the MB/sec to drop to ZERO, during journal-to-osd
> flushes?
>
A common misconception when people start up with Ceph and probably
something that should be better explained in the docs. Or not, given that
Blustore is on the shimmering horizon.

Ceph never reads from the journals, unless there has been a crash.
(Now would be a good time to read that link above if you haven't yet)

What happens is that (depending on the various filestore and journal
parameters) Ceph starts flushing the still in memory data to the OSD
(disk, FS) after the journal has been written, as I mentioned above.

The logic here is to not create an I/O storm after letting things pile up
for a long time.
People with fast storage subsystems and/or SSDs/NVMes as OSDs tend to tune
these parameters.

So now think about what happens during that rados bench run:
A 4MB object gets written (created, then filled), so the client talks to
the OSD that holds the primary PG for that object.
That OSD writes the data to the journal and sends it to the other OSDs
(replicas).
Once all journals have been written, the primary OSD acks the write to
the client.

And this happens with 16 threads by default, making things nicely busy.
Now keeping in mind the above description and the fact that you have a
small cluster, a single OSD that gets too busy will block the whole
cluster basically.

So things dropping to zero means that at least one OSD was so busy (not
CPU in your case, IOwait) that it couldn't take in more data.
The fact that your drops happen in a rather predictable, roughly 9
seconds interval, suggests also the possibility that the actual journal
got full, but that's not conclusive.

Christian

> Thanks for the quick feedback, and I'll dive into atop and iostat next.
>
> Regards,
> MJ
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux