Re: osd client op priority vs osd recovery op priority

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Greg,

thank you for your (fast) answer.

Since we're going more in-depth, in must say :
  • we're running 2 Gentoo GNU/Linux servers doing both storage and virtualization (I know this is not recommended but we mostly have a low load and virtually no writes outside of ceph)
  • sys-cluster/ceph-0.56.4  USE="radosgw -debug -fuse -gtk -libatomic -static-libs -tcmalloc"
  • app-emulation/qemu-1.2.2-r3  USE="aio caps curl jpeg ncurses png rbd sasl seccomp threads uuid vhost-net vnc -alsa -bluetooth -brltty -debug -doc -fdt -mixemu -opengl -pulseaudio -python -sdl (-selinux) -smartcard -spice -static -static-softmmu -static-user -systemtap -tci -tls -usbredir -vde -virtfs -xattr -xen -xfs"
  • app-emulation/libvirt-1.0.2-r2  USE="caps iscsi libvirtd lvm lxc macvtap nls pcap python qemu rbd sasl udev vepa virt-network -audit -avahi -firewalld -fuse -nfs -numa -openvz -parted -phyp -policykit (-selinux) -uml -virtualbox -xen"
  • 1 SSD, 3 HDDs per host.
  • monitor filesystems on SSD
  • OSD journals on SSD
  • OSD data on spinnies
  • [client]
        rbd cache = true
        rbd cache size = 128M
        rbd cache max dirty = 32M
  • We can pay for some support if required ;)
  • I know cuttlefish has some scrub-related optimizations, but cannot upgrade now

On 09/07/2013 13:04, Gregory Farnum wrote:
What kinds of performance drops are you seeing during recovery?

Mostly high latencies making some websites non responsive (LAMP stacks, mostly). Same thing for some email servers. Another problem is that my munin has difficulties fetching its data from VMs during scrubs (the munin server is also a VM and writing at this time is okay).

On a sample host HDD, my latency averages are :

Read (ms)
Write (ms)
Utilization (%)
Read throughput (kB/s)
Write throughput (kB/s)
not scrubbing (07:26-09:58)
10.08
195.41
19.06
80.40
816.84
scrubbing (10:00-11:20)
14.02
198.08
27.73
102.30
797.76

On a sample web and email server :

data coverage (approx.)
Read (ms)
Write (ms)
not scrubbing (07:26-09:58) 100%
45.02
7.36
scrubbing (10:00-11:20) 20-30%
432.73
181.19


If for instance you've got clients sendings lots of operations that are small compared to object size then the bounding won't work out quite right, or maybe you're just knocking out a bunch of servers and getting bad long-tail latency effects.

I think I can't answer this. I tend to think it's the first case, because the drives don't seems to hit even 50% utilization (CPU is around 3% and I have more than 40GB of "free" RAM).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux