Hello, On Tue, 6 Sep 2016 20:30:30 -0500 Brady Deetz wrote: > On the topic of understanding the general ebb and flow of a cluster's > throughput: > -Is there a way to monitor/observe how full a journal partition becomes > before it is flushed? > > I've been interested in increasing max sync interval from its default for > my 10GB journals, but would like to know more about my journals before I go > tuning. > Of course, the obvious place, the ceph perf counters, ala: ceph --admin-daemon /var/run/ceph/ceph-osd.nn.asok perf dump For all your OSDs, of course. I tend to graph filestore_journal_bytes with graphite, which is where I got the numbers in my referred mail from. > On Sep 6, 2016 8:20 PM, "Christian Balzer" <chibi@xxxxxxx> wrote: > > > > > hello, > > > > On Tue, 6 Sep 2016 13:38:45 +0200 lists wrote: > > > > > Hi Christian, > > > > > > Thanks for your reply. > > > > > > > What SSD model (be precise)? > > > Samsung 480GB PM863 SSD > > > > > So that's not your culprit then (they are supposed to handle sync writes > > at full speed). > > > > > > Only one SSD? > > > Yes. With a 5GB partition based journal for each osd. > > > > > A bit small, but in normal scenarios that shouldn't be a problem. > > Read: > > https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg28003.html > > > > > >> During the 0 MB/sec, there is NO increased cpu usage: it is usually > > > >> around 15 - 20% for the four ceph-osd processes. > > > >> > > > > Watch your node(s) with atop or iostat. > > > Ok, I will do. > > > > > Best results will be had with 3 large terminals (one per node) running > > atop, interval set to at least 5, down from default 10 seconds. > > Same diff with iostat, parameters "-x 2". > > > > > >> Do we have an issue..? And if yes: Anyone with a suggestions where to > > > >> look at? > > > >> > > > > You will find that either your journal SSD is overwhelmed and a single > > > > SSD peaking around 500MB/s wouldn't be that surprising. > > > > Or that your HDDs can't scribble away at more than the speed above, the > > > > more likely reason. > > > > Even a combination of both. > > > > > > > > Ceph needs to flush data to the OSDs eventually (and that is usually > > more > > > > or less immediately with default parameters), so for a sustained, > > > > sequential write test you're looking at the speed of your HDDs. > > > > And that will be spiky of sorts, due to FS journals, seeks for other > > > > writes (replicas), etc. > > > But would we expect the MB/sec to drop to ZERO, during journal-to-osd > > > flushes? > > > > > A common misconception when people start up with Ceph and probably > > something that should be better explained in the docs. Or not, given that > > Blustore is on the shimmering horizon. > > > > Ceph never reads from the journals, unless there has been a crash. > > (Now would be a good time to read that link above if you haven't yet) > > > > What happens is that (depending on the various filestore and journal > > parameters) Ceph starts flushing the still in memory data to the OSD > > (disk, FS) after the journal has been written, as I mentioned above. > > > > The logic here is to not create an I/O storm after letting things pile up > > for a long time. > > People with fast storage subsystems and/or SSDs/NVMes as OSDs tend to tune > > these parameters. > > > > So now think about what happens during that rados bench run: > > A 4MB object gets written (created, then filled), so the client talks to > > the OSD that holds the primary PG for that object. > > That OSD writes the data to the journal and sends it to the other OSDs > > (replicas). > > Once all journals have been written, the primary OSD acks the write to > > the client. > > > > And this happens with 16 threads by default, making things nicely busy. > > Now keeping in mind the above description and the fact that you have a > > small cluster, a single OSD that gets too busy will block the whole > > cluster basically. > > > > So things dropping to zero means that at least one OSD was so busy (not > > CPU in your case, IOwait) that it couldn't take in more data. > > The fact that your drops happen in a rather predictable, roughly 9 > > seconds interval, suggests also the possibility that the actual journal > > got full, but that's not conclusive. > > > > Christian > > > > > Thanks for the quick feedback, and I'll dive into atop and iostat next. > > > > > > Regards, > > > MJ > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com