Hi Greg,
thank you for your (fast) answer.
Since we're going more in-depth, in must say :
- we're running 2 Gentoo GNU/Linux servers doing both storage
and virtualization (I know this is not recommended but we mostly
have a low load and virtually no writes outside of ceph)
- sys-cluster/ceph-0.56.4 USE="radosgw -debug -fuse -gtk -libatomic -static-libs
-tcmalloc"
- app-emulation/qemu-1.2.2-r3 USE="aio caps curl jpeg
ncurses png rbd sasl seccomp threads uuid vhost-net vnc -alsa -bluetooth -brltty -debug -doc -fdt
-mixemu -opengl -pulseaudio -python -sdl (-selinux)
-smartcard -spice -static -static-softmmu -static-user
-systemtap -tci -tls -usbredir -vde -virtfs -xattr -xen -xfs"
- app-emulation/libvirt-1.0.2-r2 USE="caps iscsi libvirtd
lvm lxc macvtap nls pcap python qemu rbd sasl udev vepa
virt-network -audit -avahi -firewalld
-fuse -nfs -numa -openvz -parted -phyp -policykit (-selinux)
-uml -virtualbox -xen"
- 1 SSD, 3 HDDs per host.
- monitor filesystems on SSD
- OSD journals on SSD
- OSD data on spinnies
- [client]
rbd cache = true
rbd cache size = 128M
rbd cache max dirty = 32M
- We can pay for some support if required ;)
- I know cuttlefish has some scrub-related optimizations, but
cannot upgrade now
On 09/07/2013 13:04, Gregory Farnum wrote:
What kinds of performance drops are you seeing during recovery?
Mostly high latencies making some websites non responsive (LAMP
stacks, mostly). Same thing for some email servers. Another problem
is that my munin has difficulties fetching its data from VMs during
scrubs (the munin server is also a VM and writing at this time is
okay).
On a sample host HDD, my latency averages are :
|
Read (ms)
|
Write (ms)
|
Utilization (%)
|
Read throughput (kB/s)
|
Write throughput (kB/s)
|
not scrubbing (07:26-09:58)
|
10.08
|
195.41
|
19.06
|
80.40
|
816.84
|
scrubbing (10:00-11:20)
|
14.02
|
198.08
|
27.73
|
102.30
|
797.76
|
On a sample web and email server :
|
data coverage (approx.)
|
Read (ms)
|
Write (ms)
|
not scrubbing (07:26-09:58) |
100%
|
45.02
|
7.36
|
scrubbing (10:00-11:20) |
20-30%
|
432.73
|
181.19
|
If for instance you've got clients sendings lots of operations that are small compared to object size then the bounding won't work out quite right, or maybe you're just knocking out a bunch of servers and getting bad long-tail latency effects.
I think I can't answer this. I tend to think it's the first case,
because the drives don't seems to hit even 50% utilization (CPU is
around 3% and I have more than 40GB of "free" RAM).
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com