Re: Improving responsiveness of KVM guests on Ceph storage

Andrey Korolyov <andrey@xxxxxxx> · Mon, 31 Dec 2012 15:15:24 +0300

On Mon, Dec 31, 2012 at 2:58 PM, Jens Kristian Søgaard
<jens@xxxxxxxxxxxxxxxxxxxx> wrote:
> Hi Andrey,
>
>
>> As I understood right, you have md device holding both journal and
>> filestore? What type of raid you have here?
>
>
> Yes, same md device holding both journal and filestore. It is a raid5.

Ahem, of course you need to reassemble it to something faster :)
>
>
>> Of course you`ll need a
>> separate device (for experimental purposes, fast disk may be enough)
>> for the journal
>
>
> Is there a way to tell if the journal is the bottleneck without actually
> adding such an extra device?
>
In theory, yes - but your setup already dying under high amount of
write seeks, so it may be not necessary. Also I don`t see a right way
to measure a bottleneck when disk device used for both filestore and
journal - in case of separated ones, you may measure maximum values
using fio and compare to calculated ones from /proc/diskstats,
``all-in-one'' case seems obviously hard to measure, even if you able
to log writes to journal file and filestore files separately without
significant overhead.

>
>> filestore partition, you may also change it to simple RAID0, or even
>> separate disks, and create one osd over every disk(you should see to
>
>
> I have only 3 OSDs with 4 disks each. I was afraid that it would be too
> brittle as a RAID0, and if I created seperate OSDs for each disk, it would
> stall the file system due to recovery if a server crashes.

No, it isn`t too bad in most cases. Recovery process is not affecting
operations to the rbd storage except small performance degradation, so
you may split your raid setup to the lightweight R0. It depends, on
plain SATA controller software R0 under one OSD will do better work
than >2 separate OSDs having one disk each, on cache-backed controller
separate OSDs is more preferably until controller is not able to align
writes due to overall write bandwidth.

>
>
>> What size of cache_size/max_dirty you have inside ceph.conf
>
>
> I haven't set them explicitly, so I imagine the cache_size is 32 MB and the
> max_dirty is 24 MB.
>
>
>> and which
>>
>> qemu version you use?
>
>
> Using the default 0.15 version in Fedora 16.
>
>
>> tasks increasing cache may help OS to align writes more smoothly. Also
>> you don`t need to set rbd_cache explicitly in the disk config using
>> qemu 1.2 and younger releases, for older ones
>> http://lists.gnu.org/archive/html/qemu-devel/2012-05/msg02500.html
>> should be applied.
>
>
> I read somewhere that I needed to enable it specifically for older qemu-kvm
> versions, which I did like this:
>
>   format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio
>
> However now I read in the docs for qemu-rbd that it needs to be set like
> this:
>
>   format=raw,file=rbd:data/squeeze:rbd_cache=true,cache=writeback
>
> I'm not sure if 1 and true are interpreted the same way?
>
> I'll try using "true" and see if I get any noticable changes in behaviour.
>
> The link you sent me seems to indicate that I need to compile my own version
> of qemu-kvm to be able to test this?
>

No, there is no significant changes since 0.15 to the current version
and your options will work just fine. So there may be general
recommendations to remove redundancy from your disk backend and then
move out journal to separate disk or ssd.

>
>
> --
> Jens Kristian Søgaard, Mermaid Consulting ApS,
> jens@xxxxxxxxxxxxxxxxxxxx,
> http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html