Re: rados bench output question

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Sep 2016 10:50:48 +0900

Hello,

On Tue, 6 Sep 2016 20:30:30 -0500 Brady Deetz wrote:

> On the topic of understanding the general ebb and flow of a cluster's
> throughput:
> -Is there a way to monitor/observe how full a journal partition becomes
> before it is flushed?
> 
> I've been interested in increasing max sync interval from its default for
> my 10GB journals, but would like to know more about my journals before I go
> tuning.
> 
Of course, the obvious place, the ceph perf counters, ala:
ceph --admin-daemon /var/run/ceph/ceph-osd.nn.asok perf dump

For all your OSDs, of course.

I tend to graph filestore_journal_bytes with graphite, which is where I got
the numbers in my referred mail from.

> On Sep 6, 2016 8:20 PM, "Christian Balzer" <chibi@xxxxxxx> wrote:
> 
> >
> > hello,
> >
> > On Tue, 6 Sep 2016 13:38:45 +0200 lists wrote:
> >
> > > Hi Christian,
> > >
> > > Thanks for your reply.
> > >
> > > > What SSD model (be precise)?
> > > Samsung 480GB PM863 SSD
> > >
> > So that's not your culprit then (they are supposed to handle sync writes
> > at full speed).
> >
> > > > Only one SSD?
> > > Yes. With a 5GB partition based journal for each osd.
> > >
> > A bit small, but in normal scenarios that shouldn't be a problem.
> > Read:
> > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html
> >
> > > >> During the 0 MB/sec, there is NO increased cpu usage: it is usually
> > > >> around 15 - 20% for the four ceph-osd processes.
> > > >>
> > > > Watch your node(s) with atop or iostat.
> > > Ok, I will do.
> > >
> > Best results will be had with 3 large terminals (one per node) running
> > atop, interval set to at least 5, down from default 10 seconds.
> > Same diff with iostat, parameters "-x 2".
> >
> > > >> Do we have an issue..? And if yes: Anyone with a suggestions where to
> > > >> look at?
> > > >>
> > > > You will find that either your journal SSD is overwhelmed and a single
> > > > SSD peaking around 500MB/s wouldn't be that surprising.
> > > > Or that your HDDs can't scribble away at more than the speed above, the
> > > > more likely reason.
> > > > Even a combination of both.
> > > >
> > > > Ceph needs to flush data to the OSDs eventually (and that is usually
> > more
> > > > or less immediately with default parameters), so for a sustained,
> > > > sequential write test you're looking at the speed of your HDDs.
> > > > And that will be spiky of sorts, due to FS journals, seeks for other
> > > > writes (replicas), etc.
> > > But would we expect the MB/sec to drop to ZERO, during journal-to-osd
> > > flushes?
> > >
> > A common misconception when people start up with Ceph and probably
> > something that should be better explained in the docs. Or not, given that
> > Blustore is on the shimmering horizon.
> >
> > Ceph never reads from the journals, unless there has been a crash.
> > (Now would be a good time to read that link above if you haven't yet)
> >
> > What happens is that (depending on the various filestore and journal
> > parameters) Ceph starts flushing the still in memory data to the OSD
> > (disk, FS) after the journal has been written, as I mentioned above.
> >
> > The logic here is to not create an I/O storm after letting things pile up
> > for a long time.
> > People with fast storage subsystems and/or SSDs/NVMes as OSDs tend to tune
> > these parameters.
> >
> > So now think about what happens during that rados bench run:
> > A 4MB object gets written (created, then filled), so the client talks to
> > the OSD that holds the primary PG for that object.
> > That OSD writes the data to the journal and sends it to the other OSDs
> > (replicas).
> > Once all journals have been written, the primary OSD acks the write to
> > the client.
> >
> > And this happens with 16 threads by default, making things nicely busy.
> > Now keeping in mind the above description and the fact that you have a
> > small cluster, a single OSD that gets too busy will block the whole
> > cluster basically.
> >
> > So things dropping to zero means that at least one OSD was so busy (not
> > CPU in your case, IOwait) that it couldn't take in more data.
> > The fact that your drops happen in a rather predictable, roughly 9
> > seconds interval, suggests also the possibility that the actual journal
> > got full, but that's not conclusive.
> >
> > Christian
> >
> > > Thanks for the quick feedback, and I'll dive into atop and iostat next.
> > >
> > > Regards,
> > > MJ
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com