IO wait spike in VM

alexandre.becholey@xxxxxxxxx (Bécholey Alexandre) · Wed, 24 Sep 2014 15:27:24 +0000

Dear Ceph guru,

We have a Ceph cluster (version 0.80.5 38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) used as a backend storage for libvirt.

Hosts:
Ubuntu 14.04
CPU: 2 Xeon X5650
RAM: 48 GB (no swap)
No SSD for journals
HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition for the journal, the rest for the OSD)
FS: btrfs (I know it's not recommended in the doc and I hope it's not the culprit)
Network: dedicated 10GbE

As we added some VMs to the cluster, we saw some sporadic huge IO wait on the VM. The hosts running the OSDs seem fine.
I followed a similar discussion here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html

Here is an example of a transaction that took some time:

        { "description": "osd_op(client.5275.0:262936 rbd_data.22e42ae8944a.0000000000000807 [] 3.c9699248 ack+ondisk+write e3158)",
          "received_at": "2014-09-23 15:23:30.820958",
          "age": "108.329989",
          "duration": "5.814286",
          "type_data": [
                "commit sent; apply or cleanup",
                { "client": "client.5275",
                  "tid": 262936},
                [
                    { "time": "2014-09-23 15:23:30.821097",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-09-23 15:23:30.821282",
                      "event": "reached_pg"},
                    { "time": "2014-09-23 15:23:30.821384",
                      "event": "started"},
                    { "time": "2014-09-23 15:23:30.821401",
                      "event": "started"},
                    { "time": "2014-09-23 15:23:30.821459",
                      "event": "waiting for subops from 14"},
                    { "time": "2014-09-23 15:23:30.821561",
                      "event": "commit_queued_for_journal_write"},
                    { "time": "2014-09-23 15:23:30.821666",
                      "event": "write_thread_in_journal_buffer"},
                    { "time": "2014-09-23 15:23:30.822591",
                      "event": "op_applied"},
                    { "time": "2014-09-23 15:23:30.824707",
                      "event": "sub_op_applied_rec"},
                    { "time": "2014-09-23 15:23:31.225157",
                      "event": "journaled_completion_queued"},
                    { "time": "2014-09-23 15:23:31.225297",
                      "event": "op_commit"},
                    { "time": "2014-09-23 15:23:36.635085",
                      "event": "sub_op_commit_rec"},
                    { "time": "2014-09-23 15:23:36.635132",
                      "event": "commit_sent"},
                    { "time": "2014-09-23 15:23:36.635244",
                      "event": "done"}]]}

sub_op_commit_rec took about 5 seconds to complete which make me think that the replication is the bottleneck.

I didn't find any timeout in ceph's logs. The cluster is healthy. When some VMs have high IO wait, the cluster op/s is between 2 to 40.
I would gladly submit you more information if needed.

How can I dig deeper? For example is there a way to know to which OSD the replication is done for that specific transaction?

Cheers,
Alexandre B?choley