IO wait spike in VM

chibi@xxxxxxx (Christian Balzer) · Tue, 30 Sep 2014 00:39:30 +0900

On Mon, 29 Sep 2014 09:04:51 +0000 Quenten Grasso wrote:

> Hi Alexandre,
> 
> No problem, I hope this saves you some pain
> 
> It's probably worth going for a larger journal probably around 20Gig if
> you wish to play with tuning of "filestore max sync interval" could be
> have some interesting results. Also probably already know this however
> most of us when starting with ceph, use xfs/file for the journal instead
> of a using a partition using a "raw partition"  this removes file system
> overhead on the journal.
> 
> I highly recommend looking into dedicated journals for your systems as
> your spinning disks are going to work very hard trying to keep up with
> all the read/write seeking on these disks particularly if you're going
> to be using for vm's. Also you'll get about 1/3 of the write performance
> as a "best case scenario" using journals on the same disk and this comes
> down to the disks IOPS.
> 
> Depending on your hardware & budget you could look into using one of
> these options for dedicated journals
> 
> Intel DC P3700 400GB PCIe these are good for about ~1000mb/s write
> (haven't tested these myself however are looking to use these in our
> additional nodes) Intel DC S3700 200GB  these are good for about
> ~360mb/s write
> 
> At the time we used the Intel DC S3700 100GB  these drives don't have
> enough throughput so I'd recommend you stay away from this particular
> 100GB model.
> 
That's of course a very subjective statement. ^o^
I'm using 4 of those with 8 HDDs because it was the best bang for the
proverbial buck for me and they are plenty fast enough in that scenario.

Also in my world I run out of IOPS way, way before I run out bandwidth. 

> So if you have spare hard disk slots in your servers the 200GB DC S3700
> is the best bang for buck. Usually I run 6 spinning disks to  1 SSD in
> an ideal world I'd like to cut this back to 4 instead of 6 tho when
> using the 200GB disks.
> 
Precisely. 
A 6:1 ratio might actually be still be sufficient to keep the HDDs busy,
but unless you have a large cluster loosing 6 OSDs because one SSD
misbehaved is painful.

> Both of these SSD options would do nicely and have on board capacitors
> and very high write/wear rates as well.
> 
Agreed.

Christian
> Cheers,
> Quenten Grasso
> 
> -----Original Message-----
> From: B?choley Alexandre [mailto:alexandre.becholey at nagra.com] 
> Sent: Monday, 29 September 2014 4:15 PM
> To: Quenten Grasso; ceph-users at lists.ceph.com
> Cc: Aviolat Romain
> Subject: RE: IO wait spike in VM
> 
> Hello Quenten,
> 
> Thanks for your reply.
> 
> We have a 5GB journal for each OSD on the same disk.
> 
> Right now, we are migrating our OSD to XFS and we'll add a 5th monitor.
> We will perform the benchmarks afterwards.
> 
> Cheers,
> Alexandre
> 
> -----Original Message-----
> From: Quenten Grasso [mailto:qgrasso at onq.com.au] 
> Sent: lundi 29 septembre 2014 01:56
> To: B?choley Alexandre; ceph-users at lists.ceph.com
> Cc: Aviolat Romain
> Subject: RE: IO wait spike in VM
> 
> G'day Alexandre
> 
> I'm not sure if this is causing your issues, however it could be
> contributing to them. 
> 
> I noticed you have 4 Mon's, this could contributing to your problems as
> its recommended due to paxos algorithm which ceph uses for achieving
> quorum of mon's, you should be running an odd number of mon's 1, 3, 5,
> 7, etc Also worth it's mentioning running 4 mon's would still only give
> you a possible failure of 1 mon without an outage. 
> 
> Spec wise the machines look pretty good, only thing I can see is the
> lack of journals and using btrfs at this stage. 
> 
> You could try some iperf testing between the machines to make sure the
> networking is working as expected.
> 
> If you do rados benches for extended time what kind of stats do you see?
> 
> For example,
> 
> Write)
> ceph osd pool create benchmark1 XXXX XXXX ceph osd pool set benchmark1
> size 3 rados bench -p benchmark1 180 write --no-cleanup
> --concurrent-ios=32
> 
> * I suggest you create a 2nd benchmark pool and write for another 180
> seconds or so to ensure nothing is cached then do a read test.
> 
> Read)
> rados bench -p benchmark1 180 seq --concurrent-ios=32
> 
> You can also try the same using 4k blocks
> 
> rados bench -p benchmark1 180 write -b 4096 --no-cleanup
> --concurrent-ios=32 rados bench -p benchmark1 180 seq -b 4096
> 
> As you may know increasing the concurrent io's will increase cpu/disk
> load.
> 
> XXXX = Total PG = OSD * 100 / Replicas
> Ie: 50 OSD System with 3 replicas would be around 1600
> 
> Hope this helps a little,
> 
> Cheers,
> Quenten Grasso
> 
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of
> B?choley Alexandre Sent: Thursday, 25 September 2014 1:27 AM
> To: ceph-users at lists.ceph.com
> Cc: Aviolat Romain
> Subject: IO wait spike in VM
> 
> Dear Ceph guru,
> 
> We have a Ceph cluster (version 0.80.5
> 38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per
> host) used as a backend storage for libvirt.
> 
> Hosts:
> Ubuntu 14.04
> CPU: 2 Xeon X5650
> RAM: 48 GB (no swap)
> No SSD for journals
> HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one
> partition for the journal, the rest for the OSD) FS: btrfs (I know it's
> not recommended in the doc and I hope it's not the culprit) Network:
> dedicated 10GbE
> 
> As we added some VMs to the cluster, we saw some sporadic huge IO wait
> on the VM. The hosts running the OSDs seem fine. I followed a similar
> discussion here:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html
> 
> Here is an example of a transaction that took some time:
> 
>         { "description": "osd_op(client.5275.0:262936
> rbd_data.22e42ae8944a.0000000000000807 [] 3.c9699248 ack+ondisk+write
> e3158)", "received_at": "2014-09-23 15:23:30.820958", "age":
> "108.329989", "duration": "5.814286",
>           "type_data": [
>                 "commit sent; apply or cleanup",
>                 { "client": "client.5275",
>                   "tid": 262936},
>                 [
>                     { "time": "2014-09-23 15:23:30.821097",
>                       "event": "waiting_for_osdmap"},
>                     { "time": "2014-09-23 15:23:30.821282",
>                       "event": "reached_pg"},
>                     { "time": "2014-09-23 15:23:30.821384",
>                       "event": "started"},
>                     { "time": "2014-09-23 15:23:30.821401",
>                       "event": "started"},
>                     { "time": "2014-09-23 15:23:30.821459",
>                       "event": "waiting for subops from 14"},
>                     { "time": "2014-09-23 15:23:30.821561",
>                       "event": "commit_queued_for_journal_write"},
>                     { "time": "2014-09-23 15:23:30.821666",
>                       "event": "write_thread_in_journal_buffer"},
>                     { "time": "2014-09-23 15:23:30.822591",
>                       "event": "op_applied"},
>                     { "time": "2014-09-23 15:23:30.824707",
>                       "event": "sub_op_applied_rec"},
>                     { "time": "2014-09-23 15:23:31.225157",
>                       "event": "journaled_completion_queued"},
>                     { "time": "2014-09-23 15:23:31.225297",
>                       "event": "op_commit"},
>                     { "time": "2014-09-23 15:23:36.635085",
>                       "event": "sub_op_commit_rec"},
>                     { "time": "2014-09-23 15:23:36.635132",
>                       "event": "commit_sent"},
>                     { "time": "2014-09-23 15:23:36.635244",
>                       "event": "done"}]]}
> 
> sub_op_commit_rec took about 5 seconds to complete which make me think
> that the replication is the bottleneck.
> 
> I didn't find any timeout in ceph's logs. The cluster is healthy. When
> some VMs have high IO wait, the cluster op/s is between 2 to 40. I would
> gladly submit you more information if needed.
> 
> How can I dig deeper? For example is there a way to know to which OSD
> the replication is done for that specific transaction?
> 
> Cheers,
> Alexandre B?choley
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/