This is a short^H^H^H^H^H long mail to introduce / walk-through some recent developments in libvirt to support native Linux hosted container virtualization using the kernel capabilities the people on this list have been adding in recent releases. We've been working on this for a few months now, but not really publicised it before now, and I figure the people working on container virt extensions for Linux might be interested in how it is being used. For those who aren't familiar with libvirt, it provides a stable API for managing virtualization hosts and their guests. It started with a Xen driver, and over time has evolved to add support for QEMU, KVM, OpenVZ and most recently of all a driver we're calling "LXC" short for "LinuX Containers". The key is that no matter what hypervisor you are using, there is a consistent set of APIs, and standardized configuration format for userspace management applications in the host (and remote secure RPC to the host). The LXC driver is the result of a combined effort from a number of people in the libvirt community, most notably Dave Leskovec contributed the original code, and Dan Smith now leads development along with my own contributions to its architecture to better integrate with libvirt. We have a couple of goals in this work. Overall, libvirt wants to be the defacto standard, open source management API for all virtualization platforms and native Linux virtualization capabilities are a strong focus. The LXC driver is attempting to provide a general purpose management solution for two container virt use cases: - Application workload isolation - Virtual private servers In the first use case we want to provide the ability to run an application in primary host OS with partial restrictons on its resource / service access. It will still run with the same root directory as the host OS, but its filesystem namespace may have some additional private mount points present. It may have a private network namespace to restrict its connectivity, and it will ultimately have restrictions on its resource usage (eg memory, CPU time, CPU affinity, I/O bandwidth). In the second use case, we want to provide completely virtualized operating system in the container (running the host kernel of course), akin to the capabilities of OpenVZ / Linux-VServer. The container will have a totally private root filesystem, private networking namespace, whatever other namespace isolation the kernel provides, and again resource restirctions. Some people like to think of this as 'a better chroot than chroot'. In terms of technical implementation, at its core is direct usage of the new clone() flags. By default all containers get created with CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWUTS, CLONE_NEWUSER, and CLONE_NEWIPC. If private network config was requested they also get CLONE_NEWNET. For the workload isolation case, after creating the container we just add a number of filesystem mounts in the containers private FS namespace. In the VPS case, we'll do a pivot_root() onto the new root directory, and then add any extra filesystem mounts the container config requested. The stdin/out/err of the process leader in the container is bound to the slave end of a Psuedo TTY, libvirt owning the master end so it can provide a virtual text console into the guest container. Once the basic container setup is complete, libvirt exec the so called 'init' process. Things are thus setup such that when the 'init' process exits, the container is terminated / cleaned up. On the host side, the libvirt LXC driver creates what we call a 'controller' process for each container. This is done with a small binary /usr/libexec/libvirt_lxc. This is the process which owns the master end of the Pseduo-TTY, along with a second Pseduo-TTY pair. When the host admin wants to interact with the contain, they use the command 'virsh console CONTAINER-NAME'. The LXC controller process takes care of forwarding I/O between the two slave PTYs, one slave opened by virsh console, the other being the containers' stdin/out/err. If you kill the controller, then the container also dies. Basically you can think of the libvirt_lxc controller as serving the equivalent purpose to the 'qemu' command for full machine virtualization - it provides the interface between host and guest, in this case just the container setup, and access to text console - perhaps more in the future. For networking, libvirt provides two core concepts - Shared physical device. A bridge containing one of your physical network interfaces on the host, along with one or more of the guest vnet interfaces. So the container appears as if its directly on the LAN - Virtual network. A bridge containing only guest vnet interfaces, and NO physical device from the host. IPtables and forwarding provide routed (+ optionally NATed) connectivity to the LAN for guests. The latter use case is particularly useful for machines without a permanent wired ethernet - eg laptops, using wifi, as it lets guests talk to each other even when there's no active host network. Both of these network setups are fully supported in the LXC driver in precense of a suitably new host kernel. That's a 100ft overview and the current functionality is working quite well from an architectural/technical point of view, but there is plenty more work we still need todo to provide an system which is mature enough for real world production deployment. - Integration with cgroups. Although I talked about resource restrictions, we've not implemented any of this yet. In the most immediate timeframe we want to use cgroups' device ACL support to prevent the container having any ability to access to device nodes other than the usual suspects of /dev/{null,full,zero,console}, and possibly /dev/urandom. The other important one is to provide a memory cap across the entire container. CPU based resource control is lower priority at the moment. - Efficient query of resource utilization. We need to be able to get the cumulative CPU time of all the processes inside the container, without having to iterate over every PIDs' /proc/$PID/stat file. I'm not sure how we'll do this yet.. We want to get this data this for all CPUs, and per-CPU. - devpts virtualization. libvirt currently just bind mount the host's /dev/pts into the container. Clearly this isn't a serious impl. We've been monitoring the devpts namespace patches and these look like they will provide the capabilities we need for the full virtual private server use case - network sysfs virtualization. libvirt can't currently use the CLONE_NEWNET flag in most Linux distros, since current released kernel has this capability conflicting with SYSFS in KConfig. Again we're looking forward to seeing this addressed in next kernel - UID/GID virtualization. While we spawn all containers as root, applications inside the container may witch to unprivileged UIDs. We don't (neccessarily) want users in the host with equivalent UIDs to be able to kill processes inside the container. It would also be desirable to allow unprivileged users to create containers without needing root on the host, but allowing them to be root & any other user inside their container. I'm not aware if anyone's working on this kind of thing yet ? There're probably more things Dan Smith is thinking of but that list is a good starting point. Finally, a 30 second overview of actually using LXC usage with libvirt to create a simple VPS using busybox in its root fs... - Create a simple chroot environment using busybox mkdir /root/mycontainer mkdir /root/mycontainer/bin mkdir /root/mycontainer/sbin cp /sbin/busybox /root/mycontainer/sbin for cmd in sh ls chdir chmod rm cat vi do ln -s /root/mycontainer/bin/$cmd ../sbin/busybox done cat > /root/mycontainer/sbin/init <<EOF #!/sbin/busybox sh EOF - Create a simple libvirt configuration file for the container, defining the root filesystem, the network connection (bridged to br0 in this case), and the path to the 'init' binary (defaults to /sbin/init if omitted) # cat > mycontainer.xml <<EOF <domain type='lxc'> <name>mycontainer</name> <memory>500000</memory> <os> <type>exe</type> <init>/sbin/init</init> </os> <devices> <filesystem type='mount'> <source dir='/root/mycontainer'/> <target dir='/'/> </filesystem> <interface type='bridge'> <source network='br0'/> <mac address='00:11:22:34:34:34'/> </interface> <console type='pty' /> </devices> </domain> EOF - Load the configuration into libvirt # virsh --connect lxc:/// define mycontainer.xml # virsh --connect lxc:/// list --inactive Id Name State ---------------------------------- - mycontainer shutdown - Start the VM and query some information about it # virsh --connect lxc:/// start mycontainer # virsh --connect lxc:/// list Id Name State ---------------------------------- 28407 mycontainer running # virsh --connect lxc:/// dominfo mycontainer Id: 28407 Name: mycontainer UUID: 8369f1ac-7e46-e869-4ca5-759d51478066 OS Type: exe State: running CPU(s): 1 Max memory: 500000 kB Used memory: 500000 kB NB. the CPU/memory info here is not enforce yet. - Interact with the container # virsh --connect lxc:/// console mycontainer NB, Ctrl+] to exit when done - Query the live config - eg to discover what PTY its console is connected to # virsh --connect lxc:/// dumpxml mycontainer <domain type='lxc' id='28407'> <name>mycontainer</name> <uuid>8369f1ac-7e46-e869-4ca5-759d51478066</uuid> <memory>500000</memory> <currentMemory>500000</currentMemory> <vcpu>1</vcpu> <os> <type arch='i686'>exe</type> <init>/sbin/init</init> </os> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <filesystem type='mount'> <source dir='/root/mycontainer'/> <target dir='/'/> </filesystem> <console type='pty' tty='/dev/pts/22'> <source path='/dev/pts/22'/> <target port='0'/> </console> </devices> </domain> - Shutdown the container # virsh --connect lxc:/// destroy mycontainer There is lots more I could say, but hopefully this serves as a useful introduction to the LXC work in libvirt and how it is making use of the kernel's container based virtualization support. For those interested in finding out more, all the source is in the libvirt CVS repo, the files being those named src/lxc_conf.c, src/lxc_container.c, src/lxc_controller.c and src/lxc_driver.c. http://libvirt.org/downloads.html or via the GIT mirror of our CVS repo git clone git://git.et.redhat.com/libvirt.git Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers