Re: Ceph memory overhead when used with KVM

Jason Dillaman <jdillama@xxxxxxxxxx> · Tue, 16 May 2017 08:16:33 -0400

Sorry, I haven't had a chance to attempt to reproduce.

I do know that the librbd in-memory cache does not restrict incoming
IO to the cache size while in-flight. Therefore, if you are performing
4MB writes with a queue depth of 256, you might see up to 1GB of
memory allocated from the heap for handling the cache.

QEMU would also duplicate the IO memory for a bounce buffer
(eliminated in the latest version of QEMU and librbd) and librbd
copies the IO memory again to ensure ownership (known issue we would
like to solve) -- that would account for an additional 2GB of memory
allocations under this scenario.

These would just be a transient spike of heap usage while the IO is
in-flight, but since I'm pretty sure the default behavior of the glibc
allocator does not return slabs to the OS, I would expect high memory
overhead to remain for the life of the process.

Please feel free to open a tracker ticket here [1] and I can look into
it when I get some time.

[1] http://tracker.ceph.com/projects/rbd/issues

On Tue, May 16, 2017 at 2:52 AM, nick <nick@xxxxxxx> wrote:
> Hi Jason,
> did you have some time to check if you can reproduce the high memory usage? I
> am not sure if I should create a bug report for this or if this is expected
> behaviour.
>
> Cheers
> Nick
>
> On Monday, May 08, 2017 08:55:55 AM you wrote:
>> Thanks. One more question: was the image a clone or a stand-alone image?
>>
>> On Fri, May 5, 2017 at 2:42 AM, nick <nick@xxxxxxx> wrote:
>> > Hi,
>> > I used one of the fio example files and changed it a bit:
>> >
>> > """
>> > # This job file tries to mimic the Intel IOMeter File Server Access
>> > Pattern
>> > [global]
>> > description=Emulation of Intel IOmeter File Server Access Pattern
>> > randrepeat=0
>> > filename=/root/test.dat
>> > # IOMeter defines the server loads as the following:
>> > # iodepth=1     Linear
>> > # iodepth=4     Very Light
>> > # iodepth=8     Light
>> > # iodepth=64    Moderate
>> > # iodepth=256   Heavy
>> > iodepth=8
>> > size=80g
>> > direct=0
>> > ioengine=libaio
>> >
>> > [iometer]
>> > stonewall
>> > bs=4M
>> > rw=randrw
>> >
>> > [iometer_just_write]
>> > stonewall
>> > bs=4M
>> > rw=write
>> >
>> > [iometer_just_read]
>> > stonewall
>> > bs=4M
>> > rw=read
>> > """
>> >
>> > Then let it run:
>> > $> while true; do fio stress.fio; rm /root/test.dat; done
>> >
>> > I had this running over a weekend.
>> >
>> > Cheers
>> > Sebastian
>> >
>> > On Tuesday, May 02, 2017 02:51:06 PM Jason Dillaman wrote:
>> >> Can you share the fio job file that you utilized so I can attempt to
>> >> repeat locally?
>> >>
>> >> On Tue, May 2, 2017 at 2:51 AM, nick <nick@xxxxxxx> wrote:
>> >> > Hi Jason,
>> >> > thanks for your feedback. I did now some tests over the weekend to
>> >> > verify
>> >> > the memory overhead.
>> >> > I was using qemu 2.8 (taken from the Ubuntu Cloud Archive) with librbd
>> >> > 10.2.7 on Ubuntu 16.04 hosts. I suspected the ceph rbd cache to be the
>> >> > cause of the overhead so I just generated a lot of IO with the help of
>> >> > fio in the VMs (with a datasize of 80GB) . All VMs had 3GB of memory. I
>> >> > had to run fio multiple times, before reaching high RSS values.
>> >> > I also noticed that when using larger blocksizes during writes (like
>> >> > 4M)
>> >> > the memory overhead in the KVM process increased faster.
>> >> > I ran several fio tests (one after another) and the results are:
>> >> >
>> >> > KVM with writeback RBD cache: max. 85% memory overhead (2.5 GB
>> >> > overhead)
>> >> > KVM with writethrough RBD cache: max. 50% memory overhead
>> >> > KVM without RBD caching: less than 10% overhead all the time
>> >> > KVM with local storage (logical volume used): 8% overhead all the time
>> >> >
>> >> > I did not reach those >200% memory overhead results that we see on our
>> >> > live
>> >> > cluster, but those virtual machines have a way longer uptime as well.
>> >> >
>> >> > I also tried to reduce the RSS memory value with cache dropping on the
>> >> > physical host and in the VM. Both did not lead to any change. A reboot
>> >> > of
>> >> > the VM also does not change anything (reboot in the VM, not a new KVM
>> >> > process). The only way to reduce the RSS memory value is a live
>> >> > migration
>> >> > so far. Might this be a bug? The memory overhead sounds a bit too much
>> >> > for me.
>> >> >
>> >> > Best Regards
>> >> > Sebastian
>> >> >
>> >> > On Thursday, April 27, 2017 10:08:36 AM you wrote:
>> >> >> I know we noticed high memory usage due to librados in the Ceph
>> >> >> multipathd checker [1] -- the order of hundreds of megabytes. That
>> >> >> client was probably nearly as trivial as an application can get and I
>> >> >> just assumed it was due to large monitor maps being sent to the client
>> >> >> for whatever reason. Since we changed course on our RBD iSCSI
>> >> >> implementation, unfortunately the investigation into this high memory
>> >> >> usage fell by the wayside.
>> >> >>
>> >> >> [1]
>> >> >> http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=blob;f=libm
>> >> >> ult
>> >> >> ip
>> >> >> ath/checkers/rbd.c;h=9ea0572f2b5bd41b80bf2601137b74f92bdc7278;hb=HEAD
>> >> >>
>> >> >> On Thu, Apr 27, 2017 at 5:26 AM, nick <nick@xxxxxxx> wrote:
>> >> >> > Hi Christian,
>> >> >> > thanks for your answer.
>> >> >> > The highest value I can see for a local storage VM in our
>> >> >> > infrastructure
>> >> >> > is a memory overhead of 39%. This is big, but the majority (>90%) of
>> >> >> > our
>> >> >> > local storage VMs are using less than 10% memory overhead.
>> >> >> > For ceph storage based VMs this looks quite different. The highest
>> >> >> > value I
>> >> >> > can see currently is 244% memory overhead. So that specific
>> >> >> > allocated
>> >> >> > 3GB
>> >> >> > memory VM is using now 10.3 GB RSS memory on the physical host. This
>> >> >> > is
>> >> >> > a
>> >> >> > really huge value. In general I can see that the majority of the
>> >> >> > ceph
>> >> >> > based VMs has more than 60% memory overhead.
>> >> >> >
>> >> >> > Maybe this is also a bug related to qemu+librbd. It would be just
>> >> >> > nice
>> >> >> > to
>> >> >> > know if other people are seeing those high values as well.
>> >> >> >
>> >> >> > Cheers
>> >> >> > Sebastian
>> >> >> >
>> >> >> > On Thursday, April 27, 2017 06:10:48 PM you wrote:
>> >> >> >> Hello,
>> >> >> >>
>> >> >> >> Definitely seeing about 20% overhead with Hammer as well, so not
>> >> >> >> version
>> >> >> >> specific from where I'm standing.
>> >> >> >>
>> >> >> >> While non-RBD storage VMs by and large tend to be closer the
>> >> >> >> specified
>> >> >> >> size, I've seen them exceed things by few % at times, too.
>> >> >> >> For example a 4317968KB RSS one that ought to be 4GB.
>> >> >> >>
>> >> >> >> Regards,
>> >> >> >>
>> >> >> >> Christian
>> >> >> >>
>> >> >> >> On Thu, 27 Apr 2017 09:56:48 +0200 nick wrote:
>> >> >> >> > Hi,
>> >> >> >> > we are running a jewel ceph cluster which serves RBD volumes for
>> >> >> >> > our
>> >> >> >> > KVM
>> >> >> >> > virtual machines. Recently we noticed that our KVM machines use a
>> >> >> >> > lot
>> >> >> >> > more
>> >> >> >> > memory on the physical host system than what they should use. We
>> >> >> >> > collect
>> >> >> >> > the data with a python script which basically executes 'virsh
>> >> >> >> > dommemstat
>> >> >> >> > <virtual machine name>'. We also verified the results of the
>> >> >> >> > script
>> >> >> >> > with
>> >> >> >> > the memory stats of 'cat /proc/<kvm PID>/status' for each virtual
>> >> >> >> > machine
>> >> >> >> > and the results are the same.
>> >> >> >> >
>> >> >> >> > Here is an excerpt for one pysical host where all virtual
>> >> >> >> > machines
>> >> >> >> > are
>> >> >> >> > running since yesterday (virtual machine names removed):
>> >> >> >> >
>> >> >> >> > """
>> >> >> >> > overhead    actual    percent_overhead  rss
>> >> >> >> > ----------  --------  ----------------  --------
>> >> >> >> > 423.8 MiB   2.0 GiB                 20  2.4 GiB
>> >> >> >> > 460.1 MiB   4.0 GiB                 11  4.4 GiB
>> >> >> >> > 471.5 MiB   1.0 GiB                 46  1.5 GiB
>> >> >> >> > 472.6 MiB   4.0 GiB                 11  4.5 GiB
>> >> >> >> > 681.9 MiB   8.0 GiB                  8  8.7 GiB
>> >> >> >> > 156.1 MiB   1.0 GiB                 15  1.2 GiB
>> >> >> >> > 278.6 MiB   1.0 GiB                 27  1.3 GiB
>> >> >> >> > 290.4 MiB   1.0 GiB                 28  1.3 GiB
>> >> >> >> > 291.5 MiB   1.0 GiB                 28  1.3 GiB
>> >> >> >> > 0.0 MiB     16.0 GiB                 0  13.7 GiB
>> >> >> >> > 294.7 MiB   1.0 GiB                 28  1.3 GiB
>> >> >> >> > 135.6 MiB   1.0 GiB                 13  1.1 GiB
>> >> >> >> > 0.0 MiB     2.0 GiB                  0  1.4 GiB
>> >> >> >> > 1.5 GiB     4.0 GiB                 37  5.5 GiB
>> >> >> >> > """
>> >> >> >> >
>> >> >> >> > We are using the rbd client cache for our virtual machines, but
>> >> >> >> > it
>> >> >> >> > is
>> >> >> >> > set
>> >> >> >> > to only 128MB per machine. There is also only one rbd volume per
>> >> >> >> > virtual
>> >> >> >> > machine. We have seen more than 200% memory overhead per KVM
>> >> >> >> > machine
>> >> >> >> > on
>> >> >> >> > other physical machines. After a live migration of the virtual
>> >> >> >> > machine
>> >> >> >> > to
>> >> >> >> > another host the overhead is back to 0 and increasing slowly back
>> >> >> >> > to
>> >> >> >> > high
>> >> >> >> > values.
>> >> >> >> >
>> >> >> >> > Here are our ceph.conf settings for the clients:
>> >> >> >> > """
>> >> >> >> > [client]
>> >> >> >> > rbd cache writethrough until flush = False
>> >> >> >> > rbd cache max dirty = 100663296
>> >> >> >> > rbd cache size = 134217728
>> >> >> >> > rbd cache target dirty = 50331648
>> >> >> >> > """
>> >> >> >> >
>> >> >> >> > We noticed this behavior since we are using the jewel librbd
>> >> >> >> > libraries.
>> >> >> >> > We
>> >> >> >> > did not encounter this behavior when using the ceph infernalis
>> >> >> >> > librbd
>> >> >> >> > version. We also do not see this issue when using local storage,
>> >> >> >> > instead
>> >> >> >> > of ceph.
>> >> >> >> >
>> >> >> >> > Some version information of the physical host which runs the KVM
>> >> >> >> > machines:
>> >> >> >> > """
>> >> >> >> > OS: Ubuntu 16.04
>> >> >> >> > kernel: 4.4.0-75-generic
>> >> >> >> > librbd: 10.2.7-1xenial
>> >> >> >> > """
>> >> >> >> >
>> >> >> >> > We did try to flush and invalidate the client cache via the ceph
>> >> >> >> > admin
>> >> >> >> > socket, but this did not change any memory usage values.
>> >> >> >> >
>> >> >> >> > Does anyone encounter similar issues or does have an explanation
>> >> >> >> > for
>> >> >> >> > the
>> >> >> >> > high memory overhead?
>> >> >> >> >
>> >> >> >> > Best Regards
>> >> >> >> > Sebastian
>> >> >> >
>> >> >> > --
>> >> >> > Sebastian Nickel
>> >> >> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
>> >> >> > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
>> >> >> > _______________________________________________
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users@xxxxxxxxxxxxxx
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >> > --
>> >> > Sebastian Nickel
>> >> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
>> >> > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
>> >
>> > --
>> > Sebastian Nickel
>> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
>> > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch
>
> --
> Sebastian Nickel
> Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich
> Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com