IO wait spike in VM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alexandre,

No problem, I hope this saves you some pain

It's probably worth going for a larger journal probably around 20Gig if you wish to play with tuning of "filestore max sync interval" could be have some interesting results.
Also probably already know this however most of us when starting with ceph, use xfs/file for the journal instead of a using a partition using a "raw partition"  this removes file system overhead on the journal.

I highly recommend looking into dedicated journals for your systems as your spinning disks are going to work very hard trying to keep up with all the read/write seeking on these disks particularly if you're going to be using for vm's. 
Also you'll get about 1/3 of the write performance as a "best case scenario" using journals on the same disk and this comes down to the disks IOPS.

Depending on your hardware & budget you could look into using one of these options for dedicated journals

Intel DC P3700 400GB PCIe these are good for about ~1000mb/s write (haven't tested these myself however are looking to use these in our additional nodes)
Intel DC S3700 200GB  these are good for about ~360mb/s write

At the time we used the Intel DC S3700 100GB  these drives don't have enough throughput so I'd recommend you stay away from this particular 100GB model.

So if you have spare hard disk slots in your servers the 200GB DC S3700 is the best bang for buck. Usually I run 6 spinning disks to  1 SSD in an ideal world I'd like to cut this back to 4 instead of 6 tho when using the 200GB disks.

Both of these SSD options would do nicely and have on board capacitors and very high write/wear rates as well.

Cheers,
Quenten Grasso

-----Original Message-----
From: B?choley Alexandre [mailto:alexandre.becholey@xxxxxxxxx] 
Sent: Monday, 29 September 2014 4:15 PM
To: Quenten Grasso; ceph-users at lists.ceph.com
Cc: Aviolat Romain
Subject: RE: IO wait spike in VM

Hello Quenten,

Thanks for your reply.

We have a 5GB journal for each OSD on the same disk.

Right now, we are migrating our OSD to XFS and we'll add a 5th monitor. We will perform the benchmarks afterwards.

Cheers,
Alexandre

-----Original Message-----
From: Quenten Grasso [mailto:qgrasso@xxxxxxxxxx] 
Sent: lundi 29 septembre 2014 01:56
To: B?choley Alexandre; ceph-users at lists.ceph.com
Cc: Aviolat Romain
Subject: RE: IO wait spike in VM

G'day Alexandre

I'm not sure if this is causing your issues, however it could be contributing to them. 

I noticed you have 4 Mon's, this could contributing to your problems as its recommended due to paxos algorithm which ceph uses for achieving quorum of mon's, you should be running an odd number of mon's 1, 3, 5, 7, etc Also worth it's mentioning running 4 mon's would still only give you a possible failure of 1 mon without an outage. 

Spec wise the machines look pretty good, only thing I can see is the lack of journals and using btrfs at this stage. 

You could try some iperf testing between the machines to make sure the networking is working as expected.

If you do rados benches for extended time what kind of stats do you see?

For example,

Write)
ceph osd pool create benchmark1 XXXX XXXX ceph osd pool set benchmark1 size 3 rados bench -p benchmark1 180 write --no-cleanup --concurrent-ios=32

* I suggest you create a 2nd benchmark pool and write for another 180 seconds or so to ensure nothing is cached then do a read test.

Read)
rados bench -p benchmark1 180 seq --concurrent-ios=32

You can also try the same using 4k blocks

rados bench -p benchmark1 180 write -b 4096 --no-cleanup --concurrent-ios=32 rados bench -p benchmark1 180 seq -b 4096

As you may know increasing the concurrent io's will increase cpu/disk load.

XXXX = Total PG = OSD * 100 / Replicas
Ie: 50 OSD System with 3 replicas would be around 1600

Hope this helps a little,

Cheers,
Quenten Grasso


-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of B?choley Alexandre
Sent: Thursday, 25 September 2014 1:27 AM
To: ceph-users at lists.ceph.com
Cc: Aviolat Romain
Subject: IO wait spike in VM

Dear Ceph guru,

We have a Ceph cluster (version 0.80.5 38b73c67d375a2552d8ed67843c8a65c2c0feba6) with 4 MON and 16 OSDs (4 per host) used as a backend storage for libvirt.

Hosts:
Ubuntu 14.04
CPU: 2 Xeon X5650
RAM: 48 GB (no swap)
No SSD for journals
HDD: 4 WDC WD2003FYYS-02W0B0 (2 TB, 7200 rpm) dedicated to OSD (one partition for the journal, the rest for the OSD)
FS: btrfs (I know it's not recommended in the doc and I hope it's not the culprit)
Network: dedicated 10GbE

As we added some VMs to the cluster, we saw some sporadic huge IO wait on the VM. The hosts running the OSDs seem fine.
I followed a similar discussion here: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040621.html

Here is an example of a transaction that took some time:

        { "description": "osd_op(client.5275.0:262936 rbd_data.22e42ae8944a.0000000000000807 [] 3.c9699248 ack+ondisk+write e3158)",
          "received_at": "2014-09-23 15:23:30.820958",
          "age": "108.329989",
          "duration": "5.814286",
          "type_data": [
                "commit sent; apply or cleanup",
                { "client": "client.5275",
                  "tid": 262936},
                [
                    { "time": "2014-09-23 15:23:30.821097",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-09-23 15:23:30.821282",
                      "event": "reached_pg"},
                    { "time": "2014-09-23 15:23:30.821384",
                      "event": "started"},
                    { "time": "2014-09-23 15:23:30.821401",
                      "event": "started"},
                    { "time": "2014-09-23 15:23:30.821459",
                      "event": "waiting for subops from 14"},
                    { "time": "2014-09-23 15:23:30.821561",
                      "event": "commit_queued_for_journal_write"},
                    { "time": "2014-09-23 15:23:30.821666",
                      "event": "write_thread_in_journal_buffer"},
                    { "time": "2014-09-23 15:23:30.822591",
                      "event": "op_applied"},
                    { "time": "2014-09-23 15:23:30.824707",
                      "event": "sub_op_applied_rec"},
                    { "time": "2014-09-23 15:23:31.225157",
                      "event": "journaled_completion_queued"},
                    { "time": "2014-09-23 15:23:31.225297",
                      "event": "op_commit"},
                    { "time": "2014-09-23 15:23:36.635085",
                      "event": "sub_op_commit_rec"},
                    { "time": "2014-09-23 15:23:36.635132",
                      "event": "commit_sent"},
                    { "time": "2014-09-23 15:23:36.635244",
                      "event": "done"}]]}

sub_op_commit_rec took about 5 seconds to complete which make me think that the replication is the bottleneck.

I didn't find any timeout in ceph's logs. The cluster is healthy. When some VMs have high IO wait, the cluster op/s is between 2 to 40.
I would gladly submit you more information if needed.

How can I dig deeper? For example is there a way to know to which OSD the replication is done for that specific transaction?

Cheers,
Alexandre B?choley


_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux