Re: How to use cgroup to bind ceph-osd to a specific cpu core?

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 27 Jul 2015 14:21:11 +0200

Hi!
The /cgroup/* mount point is probably a RHEL6 thing, recent distributions seem to use /sys/fs/cgroup like in your case (maybe because of systemd?). On RHEL 6 the mount points are configured in /etc/cgconfig.conf and /cgroup is the default.

I also saw the pull request from you on github and I don’t think I’ll merge it because creating the directory if the parent does not exist could mask the non-existence of cgroups or a different mountpoint, so I think it’s better to fail and leave it up to the admin to modify the script.
A more mature solution would probably be some sort of OS-specific integration (automatic cgclassify rules, initscript-ed cgroup creation and such). When this support is already in place maintainers only need to integrate it. In newer distros a new kernel (scheduler) with more NUMA awareness and other autotuning could do a better job than this script by default.

And if any CEPH devs are listening: I saw an issue on CEPH tracker for cgroup classification http://tracker.ceph.com/issues/12424 and I humbly advise you not to do that - this will either turn into something distro-specific or it will create an Inner Platform Effect on all distros that maintainers downstream will need to replace with their own anyway. Of course since Inktank is somewhat part of RedHat now it makes sense to integrate it into RHOS, RHEV and CEPH packages for RHEL and make a profile for “tuned” or whatever does the tuning magic.

Btw has anybody else tried it? What are your results? We still use it and it makes a big difference on NUMA systems, even bigger difference when mixed with KVM guests on the same hardware.

Thanks
Jan

> On 27 Jul 2015, at 13:23, Saverio Proto <zioproto@xxxxxxxxx> wrote:
> 
> Hello Jan,
> 
> I am testing your scripts, because we want also to test OSDs and VMs
> on the same server.
> 
> I am new to cgroups, so this might be a very newbie question.
> In your script you always reference to the file
> /cgroup/cpuset/libvirt/cpuset.cpus
> 
> but I have the file in /sys/fs/cgroup/cpuset/libvirt/cpuset.cpus
> 
> I am working on Ubuntu 14.04
> 
> This difference comes from something special in your setup, or maybe
> because we are working on different Linux distributions ?
> 
> Thanks for clarification.
> 
> Saverio
> 
> 
> 
> 2015-06-30 17:50 GMT+02:00 Jan Schermer <jan@xxxxxxxxxxx>:
>> Hi all,
>> our script is available on GitHub
>> 
>> https://github.com/prozeta/pincpus
>> 
>> I haven’t had much time to do a proper README, but I hope the configuration
>> is self explanatory enough for now.
>> What it does is pin each OSD into the most “empty” cgroup assigned to a NUMA
>> node.
>> 
>> Let me know how it works for you!
>> 
>> Jan
>> 
>> 
>> On 30 Jun 2015, at 10:50, Huang Zhiteng <winston.d@xxxxxxxxx> wrote:
>> 
>> 
>> 
>> On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>>> 
>>> Not having OSDs and KVMs compete against each other is one thing.
>>> But there are more reasons to do this
>>> 
>>> 1) not moving the processes and threads between cores that much (better
>>> cache utilization)
>>> 2) aligning the processes with memory on NUMA systems (that means all
>>> modern dual socket systems) - you don’t want your OSD running on CPU1 with
>>> memory allocated to CPU2
>>> 3) the same goes for other resources like NICs or storage controllers -
>>> but that’s less important and not always practical to do
>>> 4) you can limit the scheduling domain on linux if you limit the cpuset
>>> for your OSDs (I’m not sure how important this is, just best practice)
>>> 5) you can easily limit memory or CPU usage, set priority, with much
>>> greater granularity than without cgroups
>>> 6) if you have HyperThreading enabled you get the most gain when the
>>> workloads on the threads are dissimiliar - so to have the higher throughput
>>> you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re
>>> not doing that because latency and performance of the core can vary
>>> depending on what the other thread is doing. But it might be useful to
>>> someone.
>>> 
>>> Some workloads exhibit >100% performance gain when everything aligns in a
>>> NUMA system, compared to a SMP mode on the same hardware. You likely won’t
>>> notice it on light workloads, as the interconnects (QPI) are very fast and
>>> there’s a lot of bandwidth, but for stuff like big OLAP databases or other
>>> data-manipulation workloads there’s a huge difference. And with CEPH being
>>> CPU hungy and memory intensive, we’re seeing some big gains here just by
>>> co-locating the memory with the processes….
>> 
>> Could you elaborate a it on this?  I'm interested to learn in what situation
>> memory locality helps Ceph to what extend.
>>> 
>>> 
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>> On 30 Jun 2015, at 08:12, Ray Sun <xiaoquqi@xxxxxxxxx> wrote:
>>> 
>>> Sound great, any update please let me know.
>>> 
>>> Best Regards
>>> -- Ray
>>> 
>>> On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>>>> 
>>>> I promised you all our scripts for automatic cgroup assignment - they are
>>>> in our production already and I just need to put them on github, stay tuned
>>>> tomorrow :-)
>>>> 
>>>> Jan
>>>> 
>>>> 
>>>> On 29 Jun 2015, at 19:41, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
>>>> 
>>>> Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’…
>>>> 
>>>> Thanks & Regards
>>>> Somnath
>>>> 
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>>>> Ray Sun
>>>> Sent: Monday, June 29, 2015 9:19 AM
>>>> To: ceph-users@xxxxxxxxxxxxxx
>>>> Subject:  How to use cgroup to bind ceph-osd to a specific
>>>> cpu core?
>>>> 
>>>> Cephers,
>>>> I want to bind each of my ceph-osd to a specific cpu core, but I didn't
>>>> find any document to explain that, could any one can provide me some
>>>> detailed information. Thanks.
>>>> 
>>>> Currently, my ceph is running like this:
>>>> 
>>>> oot      28692      1  0 Jun23 ?        00:37:26 /usr/bin/ceph-mon -i
>>>> seed.econe.com --pid-file /var/run/ceph/mon.seed.econe.com.pid -c
>>>> /etc/ceph/ceph.conf --cluster ceph
>>>> root      40063      1  1 Jun23 ?        02:13:31 /usr/bin/ceph-osd -i 0
>>>> --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph
>>>> root      42096      1  0 Jun23 ?        01:33:42 /usr/bin/ceph-osd -i 1
>>>> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph
>>>> root      43263      1  0 Jun23 ?        01:22:59 /usr/bin/ceph-osd -i 2
>>>> --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf --cluster ceph
>>>> root      44527      1  0 Jun23 ?        01:16:53 /usr/bin/ceph-osd -i 3
>>>> --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph
>>>> root      45863      1  0 Jun23 ?        01:25:18 /usr/bin/ceph-osd -i 4
>>>> --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph
>>>> root      47462      1  0 Jun23 ?        01:20:36 /usr/bin/ceph-osd -i 5
>>>> --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf --cluster ceph
>>>> 
>>>> Best Regards
>>>> -- Ray
>>>> 
>>>> ________________________________
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail message is
>>>> intended only for the use of the designated recipient(s) named above. If the
>>>> reader of this message is not the intended recipient, you are hereby
>>>> notified that you have received this message in error and that any review,
>>>> dissemination, distribution, or copying of this message is strictly
>>>> prohibited. If you have received this communication in error, please notify
>>>> the sender by telephone or e-mail (as shown above) immediately and destroy
>>>> any and all copies of this message in your possession (whether hard copies
>>>> or electronically stored copies).
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> 
>> 
>> --
>> Regards
>> Huang Zhiteng
>> 
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com