G'day Alexandre I'm not sure if this is causing your issues, however it could be contributing to them. I noticed you have 4 Mon's, this could contributing to your problems as its recommended due to paxos algorithm which ceph uses for achieving quorum of mon's, you should be running an odd number of mon's 1, 3, 5, 7, etc Also worth it's mentioning running 4 mon's would still only give you a possible failure of 1 mon without an outage. Spec wise the machines look pretty good, only thing I can see is the lack of journals and using btrfs at this stage. You could try some iperf testing between the machines to make sure the networking is working as expected. If you do rados benches for extended time what kind of stats do you see? For example, Write) ceph osd pool create benchmark1 XXXX XXXX ceph osd pool set benchmark1 size 3 rados bench -p benchmark1 180 write --no-cleanup --concurrent-ios=32 * I suggest you create a 2nd benchmark pool and write for another 180 seconds or so to ensure nothing is cached then do a read test. Read) rados bench -p benchmark1 180 seq --concurrent-ios=32 You can also try the same using 4k blocks rados bench -p benchmark1 180 write -b 4096 --no-cleanup --concurrent-ios=32 rados bench -p benchmark1 180 seq -b 4096 As you may know increasing the concurrent io's will increase cpu/disk load. XXXX = Total PG = OSD * 100 / Replicas Ie: 50 OSD System with 3 replicas would be around 1600 Hope this helps a little, Cheers, Quenten Grasso -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of B?choley Alexandre Sent: Thursday, 25 September 2014 1:27 AM To: ceph-users at lists.ceph.com Cc: Aviolat Romain Subject: IO wait spike in VM Dear Ceph guru, We have a Ceph cluster (version 0.80.5 38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) used as a backend storage for libvirt. Hosts: Ubuntu 14.04 CPU: 2 Xeon X5650 RAM: 48 GB (no swap) No SSD for journals HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition for the journal, the rest for the OSD) FS: btrfs (I know it's not recommended in the doc and I hope it's not the culprit) Network: dedicated 10GbE As we added some VMs to the cluster, we saw some sporadic huge IO wait on the VM. The hosts running the OSDs seem fine. I followed a similar discussion here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html Here is an example of a transaction that took some time: { "description": "osd_op(client.5275.0:262936 rbd_data.22e42ae8944a.0000000000000807 [] 3.c9699248 ack+ondisk+write e3158)", "received_at": "2014-09-23 15:23:30.820958", "age": "108.329989", "duration": "5.814286", "type_data": [ "commit sent; apply or cleanup", { "client": "client.5275", "tid": 262936}, [ { "time": "2014-09-23 15:23:30.821097", "event": "waiting_for_osdmap"}, { "time": "2014-09-23 15:23:30.821282", "event": "reached_pg"}, { "time": "2014-09-23 15:23:30.821384", "event": "started"}, { "time": "2014-09-23 15:23:30.821401", "event": "started"}, { "time": "2014-09-23 15:23:30.821459", "event": "waiting for subops from 14"}, { "time": "2014-09-23 15:23:30.821561", "event": "commit_queued_for_journal_write"}, { "time": "2014-09-23 15:23:30.821666", "event": "write_thread_in_journal_buffer"}, { "time": "2014-09-23 15:23:30.822591", "event": "op_applied"}, { "time": "2014-09-23 15:23:30.824707", "event": "sub_op_applied_rec"}, { "time": "2014-09-23 15:23:31.225157", "event": "journaled_completion_queued"}, { "time": "2014-09-23 15:23:31.225297", "event": "op_commit"}, { "time": "2014-09-23 15:23:36.635085", "event": "sub_op_commit_rec"}, { "time": "2014-09-23 15:23:36.635132", "event": "commit_sent"}, { "time": "2014-09-23 15:23:36.635244", "event": "done"}]]} sub_op_commit_rec took about 5 seconds to complete which make me think that the replication is the bottleneck. I didn't find any timeout in ceph's logs. The cluster is healthy. When some VMs have high IO wait, the cluster op/s is between 2 to 40. I would gladly submit you more information if needed. How can I dig deeper? For example is there a way to know to which OSD the replication is done for that specific transaction? Cheers, Alexandre B?choley _______________________________________________ ceph-users mailing list ceph-users at lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com