Re: rados bench output question

Brady Deetz <bdeetz@xxxxxxxxx> · Tue, 6 Sep 2016 20:30:30 -0500

On the topic of understanding the general ebb and flow of a cluster's throughput:

-Is there a way to monitor/observe how full a journal partition becomes before it is flushed?
I've been interested in increasing max sync interval from its default for my 10GB journals, but would like to know more about my journals before I go tuning. 

On Sep 6, 2016 8:20 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:

hello,

On Tue, 6 Sep 2016 13:38:45 +0200 lists wrote:

> Hi Christian,

>

> Thanks for your reply.

>

> > What SSD model (be precise)?

> Samsung 480GB PM863 SSD

>

So that's not your culprit then (they are supposed to handle sync writes

at full speed).

> > Only one SSD?

> Yes. With a 5GB partition based journal for each osd.

>

A bit small, but in normal scenarios that shouldn't be a problem.

Read:

https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html

> >> During the 0 MB/sec, there is NO increased cpu usage: it is usually

> >> around 15 - 20% for the four ceph-osd processes.

> >>

> > Watch your node(s) with atop or iostat.

> Ok, I will do.

>

Best results will be had with 3 large terminals (one per node) running

atop, interval set to at least 5, down from default 10 seconds.

Same diff with iostat, parameters "-x 2".

> >> Do we have an issue..? And if yes: Anyone with a suggestions where to

> >> look at?

> >>

> > You will find that either your journal SSD is overwhelmed and a single

> > SSD peaking around 500MB/s wouldn't be that surprising.

> > Or that your HDDs can't scribble away at more than the speed above, the

> > more likely reason.

> > Even a combination of both.

> >

> > Ceph needs to flush data to the OSDs eventually (and that is usually more

> > or less immediately with default parameters), so for a sustained,

> > sequential write test you're looking at the speed of your HDDs.

> > And that will be spiky of sorts, due to FS journals, seeks for other

> > writes (replicas), etc.

> But would we expect the MB/sec to drop to ZERO, during journal-to-osd

> flushes?

>

A common misconception when people start up with Ceph and probably

something that should be better explained in the docs. Or not, given that

Blustore is on the shimmering horizon.

Ceph never reads from the journals, unless there has been a crash.

(Now would be a good time to read that link above if you haven't yet)

What happens is that (depending on the various filestore and journal

parameters) Ceph starts flushing the still in memory data to the OSD

(disk, FS) after the journal has been written, as I mentioned above.

The logic here is to not create an I/O storm after letting things pile up

for a long time.

People with fast storage subsystems and/or SSDs/NVMes as OSDs tend to tune

these parameters.

So now think about what happens during that rados bench run:

A 4MB object gets written (created, then filled), so the client talks to

the OSD that holds the primary PG for that object.

That OSD writes the data to the journal and sends it to the other OSDs

(replicas).

Once all journals have been written, the primary OSD acks the write to

the client.

And this happens with 16 threads by default, making things nicely busy.

Now keeping in mind the above description and the fact that you have a

small cluster, a single OSD that gets too busy will block the whole

cluster basically.

So things dropping to zero means that at least one OSD was so busy (not

CPU in your case, IOwait) that it couldn't take in more data.

The fact that your drops happen in a rather predictable, roughly 9

seconds interval, suggests also the possibility that the actual journal

got full, but that's not conclusive.

Christian

> Thanks for the quick feedback, and I'll dive into atop and iostat next.

>

> Regards,

> MJ

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

--

Christian Balzer        Network/Systems Engineer

chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com