On Mon, 29 Sep 2014 09:04:51 +0000 Quenten Grasso wrote: > Hi Alexandre, > > No problem, I hope this saves you some pain > > It's probably worth going for a larger journal probably around 20Gig if > you wish to play with tuning of "filestore max sync interval" could be > have some interesting results. Also probably already know this however > most of us when starting with ceph, use xfs/file for the journal instead > of a using a partition using a "raw partition" this removes file system > overhead on the journal. > > I highly recommend looking into dedicated journals for your systems as > your spinning disks are going to work very hard trying to keep up with > all the read/write seeking on these disks particularly if you're going > to be using for vm's. Also you'll get about 1/3 of the write performance > as a "best case scenario" using journals on the same disk and this comes > down to the disks IOPS. > > Depending on your hardware & budget you could look into using one of > these options for dedicated journals > > Intel DC P3700 400GB PCIe these are good for about ~1000mb/s write > (haven't tested these myself however are looking to use these in our > additional nodes) Intel DC S3700 200GB these are good for about > ~360mb/s write > > At the time we used the Intel DC S3700 100GB these drives don't have > enough throughput so I'd recommend you stay away from this particular > 100GB model. > That's of course a very subjective statement. ^o^ I'm using 4 of those with 8 HDDs because it was the best bang for the proverbial buck for me and they are plenty fast enough in that scenario. Also in my world I run out of IOPS way, way before I run out bandwidth. > So if you have spare hard disk slots in your servers the 200GB DC S3700 > is the best bang for buck. Usually I run 6 spinning disks to 1 SSD in > an ideal world I'd like to cut this back to 4 instead of 6 tho when > using the 200GB disks. > Precisely. A 6:1 ratio might actually be still be sufficient to keep the HDDs busy, but unless you have a large cluster loosing 6 OSDs because one SSD misbehaved is painful. > Both of these SSD options would do nicely and have on board capacitors > and very high write/wear rates as well. > Agreed. Christian > Cheers, > Quenten Grasso > > -----Original Message----- > From: B?choley Alexandre [mailto:alexandre.becholey at nagra.com] > Sent: Monday, 29 September 2014 4:15 PM > To: Quenten Grasso; ceph-users at lists.ceph.com > Cc: Aviolat Romain > Subject: RE: IO wait spike in VM > > Hello Quenten, > > Thanks for your reply. > > We have a 5GB journal for each OSD on the same disk. > > Right now, we are migrating our OSD to XFS and we'll add a 5th monitor. > We will perform the benchmarks afterwards. > > Cheers, > Alexandre > > -----Original Message----- > From: Quenten Grasso [mailto:qgrasso at onq.com.au] > Sent: lundi 29 septembre 2014 01:56 > To: B?choley Alexandre; ceph-users at lists.ceph.com > Cc: Aviolat Romain > Subject: RE: IO wait spike in VM > > G'day Alexandre > > I'm not sure if this is causing your issues, however it could be > contributing to them. > > I noticed you have 4 Mon's, this could contributing to your problems as > its recommended due to paxos algorithm which ceph uses for achieving > quorum of mon's, you should be running an odd number of mon's 1, 3, 5, > 7, etc Also worth it's mentioning running 4 mon's would still only give > you a possible failure of 1 mon without an outage. > > Spec wise the machines look pretty good, only thing I can see is the > lack of journals and using btrfs at this stage. > > You could try some iperf testing between the machines to make sure the > networking is working as expected. > > If you do rados benches for extended time what kind of stats do you see? > > For example, > > Write) > ceph osd pool create benchmark1 XXXX XXXX ceph osd pool set benchmark1 > size 3 rados bench -p benchmark1 180 write --no-cleanup > --concurrent-ios=32 > > * I suggest you create a 2nd benchmark pool and write for another 180 > seconds or so to ensure nothing is cached then do a read test. > > Read) > rados bench -p benchmark1 180 seq --concurrent-ios=32 > > You can also try the same using 4k blocks > > rados bench -p benchmark1 180 write -b 4096 --no-cleanup > --concurrent-ios=32 rados bench -p benchmark1 180 seq -b 4096 > > As you may know increasing the concurrent io's will increase cpu/disk > load. > > XXXX = Total PG = OSD * 100 / Replicas > Ie: 50 OSD System with 3 replicas would be around 1600 > > Hope this helps a little, > > Cheers, > Quenten Grasso > > > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of > B?choley Alexandre Sent: Thursday, 25 September 2014 1:27 AM > To: ceph-users at lists.ceph.com > Cc: Aviolat Romain > Subject: IO wait spike in VM > > Dear Ceph guru, > > We have a Ceph cluster (version 0.80.5 > 38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per > host) used as a backend storage for libvirt. > > Hosts: > Ubuntu 14.04 > CPU: 2 Xeon X5650 > RAM: 48 GB (no swap) > No SSD for journals > HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one > partition for the journal, the rest for the OSD) FS: btrfs (I know it's > not recommended in the doc and I hope it's not the culprit) Network: > dedicated 10GbE > > As we added some VMs to the cluster, we saw some sporadic huge IO wait > on the VM. The hosts running the OSDs seem fine. I followed a similar > discussion here: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html > > Here is an example of a transaction that took some time: > > { "description": "osd_op(client.5275.0:262936 > rbd_data.22e42ae8944a.0000000000000807 [] 3.c9699248 ack+ondisk+write > e3158)", "received_at": "2014-09-23 15:23:30.820958", "age": > "108.329989", "duration": "5.814286", > "type_data": [ > "commit sent; apply or cleanup", > { "client": "client.5275", > "tid": 262936}, > [ > { "time": "2014-09-23 15:23:30.821097", > "event": "waiting_for_osdmap"}, > { "time": "2014-09-23 15:23:30.821282", > "event": "reached_pg"}, > { "time": "2014-09-23 15:23:30.821384", > "event": "started"}, > { "time": "2014-09-23 15:23:30.821401", > "event": "started"}, > { "time": "2014-09-23 15:23:30.821459", > "event": "waiting for subops from 14"}, > { "time": "2014-09-23 15:23:30.821561", > "event": "commit_queued_for_journal_write"}, > { "time": "2014-09-23 15:23:30.821666", > "event": "write_thread_in_journal_buffer"}, > { "time": "2014-09-23 15:23:30.822591", > "event": "op_applied"}, > { "time": "2014-09-23 15:23:30.824707", > "event": "sub_op_applied_rec"}, > { "time": "2014-09-23 15:23:31.225157", > "event": "journaled_completion_queued"}, > { "time": "2014-09-23 15:23:31.225297", > "event": "op_commit"}, > { "time": "2014-09-23 15:23:36.635085", > "event": "sub_op_commit_rec"}, > { "time": "2014-09-23 15:23:36.635132", > "event": "commit_sent"}, > { "time": "2014-09-23 15:23:36.635244", > "event": "done"}]]} > > sub_op_commit_rec took about 5 seconds to complete which make me think > that the replication is the bottleneck. > > I didn't find any timeout in ceph's logs. The cluster is healthy. When > some VMs have high IO wait, the cluster op/s is between 2 to 40. I would > gladly submit you more information if needed. > > How can I dig deeper? For example is there a way to know to which OSD > the replication is done for that specific transaction? > > Cheers, > Alexandre B?choley > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/