Re: cluster down during backfilling, Jewel tunables and client IO optimisations

Daniel Swarbrick <daniel.swarbrick@xxxxxxxxxxxxxxxx> · Wed, 22 Jun 2016 18:09:48 +0200

On 22/06/16 17:54, Andrei Mikhailovsky wrote:
> Hi Daniel,
> 
> Many thanks for your useful tests and your results.
> 
> How much IO wait do you have on your client vms? Has it significantly increased or not?
> 

Hi Andrei,

Bearing in mind that this cluster is tiny (four nodes, each with four
OSDs), our metrics may not be that meaningful. However, on a VM that is
running ElasticSearch, collecting logs from Graylog, we're seeing no
more than about 5% iowait for a 5s period, and most of the time it's
below 1%. This VM is really not writing a lot of data though.

The cluster as a whole is peaking at only about 1200 write op/s,
according to ceph -w.

Executing a "sync" in a VM does of course have a noticeable delay due to
the recovery happening in the background, but nothing is waiting for IO
long enough to trigger the kernel's 120s timer / warning.

The recovery has been running for about four hours now, and is down to
20% misplaced objects. So far we have not had any clients block
indefinitely, so I think the migration of VMs to Jewel-capable
hypervisors did the trick.

Best,
Daniel

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com