Hi, we are running a jewel ceph cluster which serves RBD volumes for our KVM virtual machines. Recently we noticed that our KVM machines use a lot more memory on the physical host system than what they should use. We collect the data with a python script which basically executes 'virsh dommemstat <virtual machine name>'. We also verified the results of the script with the memory stats of 'cat /proc/<kvm PID>/status' for each virtual machine and the results are the same. Here is an excerpt for one pysical host where all virtual machines are running since yesterday (virtual machine names removed): """ overhead actual percent_overhead rss ---------- -------- ---------------- -------- 423.8 MiB 2.0 GiB 20 2.4 GiB 460.1 MiB 4.0 GiB 11 4.4 GiB 471.5 MiB 1.0 GiB 46 1.5 GiB 472.6 MiB 4.0 GiB 11 4.5 GiB 681.9 MiB 8.0 GiB 8 8.7 GiB 156.1 MiB 1.0 GiB 15 1.2 GiB 278.6 MiB 1.0 GiB 27 1.3 GiB 290.4 MiB 1.0 GiB 28 1.3 GiB 291.5 MiB 1.0 GiB 28 1.3 GiB 0.0 MiB 16.0 GiB 0 13.7 GiB 294.7 MiB 1.0 GiB 28 1.3 GiB 135.6 MiB 1.0 GiB 13 1.1 GiB 0.0 MiB 2.0 GiB 0 1.4 GiB 1.5 GiB 4.0 GiB 37 5.5 GiB """ We are using the rbd client cache for our virtual machines, but it is set to only 128MB per machine. There is also only one rbd volume per virtual machine. We have seen more than 200% memory overhead per KVM machine on other physical machines. After a live migration of the virtual machine to another host the overhead is back to 0 and increasing slowly back to high values. Here are our ceph.conf settings for the clients: """ [client] rbd cache writethrough until flush = False rbd cache max dirty = 100663296 rbd cache size = 134217728 rbd cache target dirty = 50331648 """ We noticed this behavior since we are using the jewel librbd libraries. We did not encounter this behavior when using the ceph infernalis librbd version. We also do not see this issue when using local storage, instead of ceph. Some version information of the physical host which runs the KVM machines: """ OS: Ubuntu 16.04 kernel: 4.4.0-75-generic librbd: 10.2.7-1xenial """ We did try to flush and invalidate the client cache via the ceph admin socket, but this did not change any memory usage values. Does anyone encounter similar issues or does have an explanation for the high memory overhead? Best Regards Sebastian
Attachment:
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com