Having trouble getting good performance

J David <j.david.lists@xxxxxxxxx> · Wed, 22 Apr 2015 14:42:20 -0400

A very small 3-node Ceph cluster with this OSD tree:

http://pastebin.com/mUhayBk9

has some performance issues.  All 27 OSDs are 5TB SATA drives, it
keeps two copies of everything, and it's really only intended for
nearline backups of large data objects.

All of the OSDs look OK in terms of utilization.  iostat reports them
around 5-10 %util with occasional spikes up to 20%, and doing
typically 5-20 IOPs each.  (Which, if we figure 100  IOPs per drive,
matches up nicely with the %util.

The Ceph nodes also look health.  CPU is 75-90% idle.  Plenty of RAM
(the smaller node has 32GiB, the larger ones have 64GiB.  They have
dual 10GBase-T NICs on an unloaded, dedicated storage switch.

The pool has 4096 placement groups.  Currently we have noscrub &
nodeep-scrub set to eliminate sctubbing as source of performance
problems.

When we increased placement groups from the default to 4096, a ton of
data moved and quickly, the cluster is definitely capable of pushing
data around at multi-gigabit speeds if it wants to.

Yet this cluster has (currently) one client, a KVM machine backed by a
32TB RBD image.  It is a Linux VM running ZFS that recevies ZFS
snapshots from other machines to back them up.  However, the
performance is pretty bad.  When it is receiving even one snapshot,
iostat -x on the client reports it is choking on I/O to the Ceph rbd
image: 99.88 %util while doing 50-75 IOPs and 5-20 MB/sec of
throughput.

Is there anything we can do to improve the performance of this
configuration, or at least figure out why it is so bad?  These are
large SATA drives with no SSD OSD's, so we don't expect miracles.  But
it sure would be nice to see client I/O that was better than 50 IOPs
and 20MB/sec.

Thanks in advance for any help or guidance on this!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com