Sorry, I haven't had a chance to attempt to reproduce. I do know that the librbd in-memory cache does not restrict incoming IO to the cache size while in-flight. Therefore, if you are performing 4MB writes with a queue depth of 256, you might see up to 1GB of memory allocated from the heap for handling the cache. QEMU would also duplicate the IO memory for a bounce buffer (eliminated in the latest version of QEMU and librbd) and librbd copies the IO memory again to ensure ownership (known issue we would like to solve) -- that would account for an additional 2GB of memory allocations under this scenario. These would just be a transient spike of heap usage while the IO is in-flight, but since I'm pretty sure the default behavior of the glibc allocator does not return slabs to the OS, I would expect high memory overhead to remain for the life of the process. Please feel free to open a tracker ticket here [1] and I can look into it when I get some time. [1] http://tracker.ceph.com/projects/rbd/issues On Tue, May 16, 2017 at 2:52 AM, nick <nick@xxxxxxx> wrote: > Hi Jason, > did you have some time to check if you can reproduce the high memory usage? I > am not sure if I should create a bug report for this or if this is expected > behaviour. > > Cheers > Nick > > On Monday, May 08, 2017 08:55:55 AM you wrote: >> Thanks. One more question: was the image a clone or a stand-alone image? >> >> On Fri, May 5, 2017 at 2:42 AM, nick <nick@xxxxxxx> wrote: >> > Hi, >> > I used one of the fio example files and changed it a bit: >> > >> > """ >> > # This job file tries to mimic the Intel IOMeter File Server Access >> > Pattern >> > [global] >> > description=Emulation of Intel IOmeter File Server Access Pattern >> > randrepeat=0 >> > filename=/root/test.dat >> > # IOMeter defines the server loads as the following: >> > # iodepth=1 Linear >> > # iodepth=4 Very Light >> > # iodepth=8 Light >> > # iodepth=64 Moderate >> > # iodepth=256 Heavy >> > iodepth=8 >> > size=80g >> > direct=0 >> > ioengine=libaio >> > >> > [iometer] >> > stonewall >> > bs=4M >> > rw=randrw >> > >> > [iometer_just_write] >> > stonewall >> > bs=4M >> > rw=write >> > >> > [iometer_just_read] >> > stonewall >> > bs=4M >> > rw=read >> > """ >> > >> > Then let it run: >> > $> while true; do fio stress.fio; rm /root/test.dat; done >> > >> > I had this running over a weekend. >> > >> > Cheers >> > Sebastian >> > >> > On Tuesday, May 02, 2017 02:51:06 PM Jason Dillaman wrote: >> >> Can you share the fio job file that you utilized so I can attempt to >> >> repeat locally? >> >> >> >> On Tue, May 2, 2017 at 2:51 AM, nick <nick@xxxxxxx> wrote: >> >> > Hi Jason, >> >> > thanks for your feedback. I did now some tests over the weekend to >> >> > verify >> >> > the memory overhead. >> >> > I was using qemu 2.8 (taken from the Ubuntu Cloud Archive) with librbd >> >> > 10.2.7 on Ubuntu 16.04 hosts. I suspected the ceph rbd cache to be the >> >> > cause of the overhead so I just generated a lot of IO with the help of >> >> > fio in the VMs (with a datasize of 80GB) . All VMs had 3GB of memory. I >> >> > had to run fio multiple times, before reaching high RSS values. >> >> > I also noticed that when using larger blocksizes during writes (like >> >> > 4M) >> >> > the memory overhead in the KVM process increased faster. >> >> > I ran several fio tests (one after another) and the results are: >> >> > >> >> > KVM with writeback RBD cache: max. 85% memory overhead (2.5 GB >> >> > overhead) >> >> > KVM with writethrough RBD cache: max. 50% memory overhead >> >> > KVM without RBD caching: less than 10% overhead all the time >> >> > KVM with local storage (logical volume used): 8% overhead all the time >> >> > >> >> > I did not reach those >200% memory overhead results that we see on our >> >> > live >> >> > cluster, but those virtual machines have a way longer uptime as well. >> >> > >> >> > I also tried to reduce the RSS memory value with cache dropping on the >> >> > physical host and in the VM. Both did not lead to any change. A reboot >> >> > of >> >> > the VM also does not change anything (reboot in the VM, not a new KVM >> >> > process). The only way to reduce the RSS memory value is a live >> >> > migration >> >> > so far. Might this be a bug? The memory overhead sounds a bit too much >> >> > for me. >> >> > >> >> > Best Regards >> >> > Sebastian >> >> > >> >> > On Thursday, April 27, 2017 10:08:36 AM you wrote: >> >> >> I know we noticed high memory usage due to librados in the Ceph >> >> >> multipathd checker [1] -- the order of hundreds of megabytes. That >> >> >> client was probably nearly as trivial as an application can get and I >> >> >> just assumed it was due to large monitor maps being sent to the client >> >> >> for whatever reason. Since we changed course on our RBD iSCSI >> >> >> implementation, unfortunately the investigation into this high memory >> >> >> usage fell by the wayside. >> >> >> >> >> >> [1] >> >> >> http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=blob;f=libm >> >> >> ult >> >> >> ip >> >> >> ath/checkers/rbd.c;h=9ea0572f2b5bd41b80bf2601137b74f92bdc7278;hb=HEAD >> >> >> >> >> >> On Thu, Apr 27, 2017 at 5:26 AM, nick <nick@xxxxxxx> wrote: >> >> >> > Hi Christian, >> >> >> > thanks for your answer. >> >> >> > The highest value I can see for a local storage VM in our >> >> >> > infrastructure >> >> >> > is a memory overhead of 39%. This is big, but the majority (>90%) of >> >> >> > our >> >> >> > local storage VMs are using less than 10% memory overhead. >> >> >> > For ceph storage based VMs this looks quite different. The highest >> >> >> > value I >> >> >> > can see currently is 244% memory overhead. So that specific >> >> >> > allocated >> >> >> > 3GB >> >> >> > memory VM is using now 10.3 GB RSS memory on the physical host. This >> >> >> > is >> >> >> > a >> >> >> > really huge value. In general I can see that the majority of the >> >> >> > ceph >> >> >> > based VMs has more than 60% memory overhead. >> >> >> > >> >> >> > Maybe this is also a bug related to qemu+librbd. It would be just >> >> >> > nice >> >> >> > to >> >> >> > know if other people are seeing those high values as well. >> >> >> > >> >> >> > Cheers >> >> >> > Sebastian >> >> >> > >> >> >> > On Thursday, April 27, 2017 06:10:48 PM you wrote: >> >> >> >> Hello, >> >> >> >> >> >> >> >> Definitely seeing about 20% overhead with Hammer as well, so not >> >> >> >> version >> >> >> >> specific from where I'm standing. >> >> >> >> >> >> >> >> While non-RBD storage VMs by and large tend to be closer the >> >> >> >> specified >> >> >> >> size, I've seen them exceed things by few % at times, too. >> >> >> >> For example a 4317968KB RSS one that ought to be 4GB. >> >> >> >> >> >> >> >> Regards, >> >> >> >> >> >> >> >> Christian >> >> >> >> >> >> >> >> On Thu, 27 Apr 2017 09:56:48 +0200 nick wrote: >> >> >> >> > Hi, >> >> >> >> > we are running a jewel ceph cluster which serves RBD volumes for >> >> >> >> > our >> >> >> >> > KVM >> >> >> >> > virtual machines. Recently we noticed that our KVM machines use a >> >> >> >> > lot >> >> >> >> > more >> >> >> >> > memory on the physical host system than what they should use. We >> >> >> >> > collect >> >> >> >> > the data with a python script which basically executes 'virsh >> >> >> >> > dommemstat >> >> >> >> > <virtual machine name>'. We also verified the results of the >> >> >> >> > script >> >> >> >> > with >> >> >> >> > the memory stats of 'cat /proc/<kvm PID>/status' for each virtual >> >> >> >> > machine >> >> >> >> > and the results are the same. >> >> >> >> > >> >> >> >> > Here is an excerpt for one pysical host where all virtual >> >> >> >> > machines >> >> >> >> > are >> >> >> >> > running since yesterday (virtual machine names removed): >> >> >> >> > >> >> >> >> > """ >> >> >> >> > overhead actual percent_overhead rss >> >> >> >> > ---------- -------- ---------------- -------- >> >> >> >> > 423.8 MiB 2.0 GiB 20 2.4 GiB >> >> >> >> > 460.1 MiB 4.0 GiB 11 4.4 GiB >> >> >> >> > 471.5 MiB 1.0 GiB 46 1.5 GiB >> >> >> >> > 472.6 MiB 4.0 GiB 11 4.5 GiB >> >> >> >> > 681.9 MiB 8.0 GiB 8 8.7 GiB >> >> >> >> > 156.1 MiB 1.0 GiB 15 1.2 GiB >> >> >> >> > 278.6 MiB 1.0 GiB 27 1.3 GiB >> >> >> >> > 290.4 MiB 1.0 GiB 28 1.3 GiB >> >> >> >> > 291.5 MiB 1.0 GiB 28 1.3 GiB >> >> >> >> > 0.0 MiB 16.0 GiB 0 13.7 GiB >> >> >> >> > 294.7 MiB 1.0 GiB 28 1.3 GiB >> >> >> >> > 135.6 MiB 1.0 GiB 13 1.1 GiB >> >> >> >> > 0.0 MiB 2.0 GiB 0 1.4 GiB >> >> >> >> > 1.5 GiB 4.0 GiB 37 5.5 GiB >> >> >> >> > """ >> >> >> >> > >> >> >> >> > We are using the rbd client cache for our virtual machines, but >> >> >> >> > it >> >> >> >> > is >> >> >> >> > set >> >> >> >> > to only 128MB per machine. There is also only one rbd volume per >> >> >> >> > virtual >> >> >> >> > machine. We have seen more than 200% memory overhead per KVM >> >> >> >> > machine >> >> >> >> > on >> >> >> >> > other physical machines. After a live migration of the virtual >> >> >> >> > machine >> >> >> >> > to >> >> >> >> > another host the overhead is back to 0 and increasing slowly back >> >> >> >> > to >> >> >> >> > high >> >> >> >> > values. >> >> >> >> > >> >> >> >> > Here are our ceph.conf settings for the clients: >> >> >> >> > """ >> >> >> >> > [client] >> >> >> >> > rbd cache writethrough until flush = False >> >> >> >> > rbd cache max dirty = 100663296 >> >> >> >> > rbd cache size = 134217728 >> >> >> >> > rbd cache target dirty = 50331648 >> >> >> >> > """ >> >> >> >> > >> >> >> >> > We noticed this behavior since we are using the jewel librbd >> >> >> >> > libraries. >> >> >> >> > We >> >> >> >> > did not encounter this behavior when using the ceph infernalis >> >> >> >> > librbd >> >> >> >> > version. We also do not see this issue when using local storage, >> >> >> >> > instead >> >> >> >> > of ceph. >> >> >> >> > >> >> >> >> > Some version information of the physical host which runs the KVM >> >> >> >> > machines: >> >> >> >> > """ >> >> >> >> > OS: Ubuntu 16.04 >> >> >> >> > kernel: 4.4.0-75-generic >> >> >> >> > librbd: 10.2.7-1xenial >> >> >> >> > """ >> >> >> >> > >> >> >> >> > We did try to flush and invalidate the client cache via the ceph >> >> >> >> > admin >> >> >> >> > socket, but this did not change any memory usage values. >> >> >> >> > >> >> >> >> > Does anyone encounter similar issues or does have an explanation >> >> >> >> > for >> >> >> >> > the >> >> >> >> > high memory overhead? >> >> >> >> > >> >> >> >> > Best Regards >> >> >> >> > Sebastian >> >> >> > >> >> >> > -- >> >> >> > Sebastian Nickel >> >> >> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich >> >> >> > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch >> >> >> > _______________________________________________ >> >> >> > ceph-users mailing list >> >> >> > ceph-users@xxxxxxxxxxxxxx >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > >> >> > -- >> >> > Sebastian Nickel >> >> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich >> >> > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch >> > >> > -- >> > Sebastian Nickel >> > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich >> > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch > > -- > Sebastian Nickel > Nine Internet Solutions AG, Albisriederstr. 243a, CH-8047 Zuerich > Tel +41 44 637 40 00 | Support +41 44 637 40 40 | www.nine.ch -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com