An introduction to libvirt's LXC (LinuX Container) support

"Daniel P. Berrange" <berrange@xxxxxxxxxx> · Wed, 17 Sep 2008 16:06:35 +0100

This is a short^H^H^H^H^H long mail to introduce / walk-through some
recent developments in libvirt to support native Linux hosted
container virtualization using the kernel capabilities the people
on this list have been adding in recent releases. We've been working
on this for a few months now, but not really publicised it before
now, and I figure the people working on container virt extensions
for Linux might be interested in how it is being used.

For those who aren't familiar with libvirt, it provides a stable API
for managing virtualization hosts and their guests. It started with
a Xen driver, and over time has evolved to add support for QEMU, KVM,
OpenVZ and most recently of all a driver we're calling "LXC" short
for "LinuX Containers". The key is that no matter what hypervisor
you are using, there is a consistent set of APIs, and standardized
configuration format for userspace management applications in the
host (and remote secure RPC to the host).

The LXC driver is the result of a combined effort from a number of
people in the libvirt community, most notably Dave Leskovec contributed
the original code, and Dan Smith now leads development along with my
own contributions to its architecture to better integrate with libvirt.

We have a couple of goals in this work. Overall, libvirt wants to be
the defacto standard, open source management API for all virtualization
platforms and native Linux virtualization capabilities are a strong
focus. The LXC driver is attempting to provide a general purpose
management solution for two container virt use cases:

 - Application workload isolation
 - Virtual private servers

In the first use case we want to provide the ability to run an
application in primary host OS with partial restrictons on its
resource / service access. It will still run with the same root
directory as the host OS, but its filesystem namespace may have
some additional private mount points present. It may have a
private network namespace to restrict its connectivity, and it
will ultimately have restrictions on its resource usage (eg
memory, CPU time, CPU affinity, I/O bandwidth).

In the second use case, we want to provide completely virtualized
operating system in the container (running the host kernel of
course), akin to the capabilities of OpenVZ / Linux-VServer. The
container will have a totally private root filesystem, private
networking namespace, whatever other namespace isolation the
kernel provides, and again resource restirctions. Some people
like to think of this as 'a better chroot than chroot'.

In terms of technical implementation, at its core is direct usage 
of the new clone() flags. By default all containers get created 
with CLONE_NEWPID, CLONE_NEWNS, CLONE_NEWUTS, CLONE_NEWUSER, and
CLONE_NEWIPC. If private network config was requested they also
get CLONE_NEWNET.

For the workload isolation case, after creating the container we
just add a number of filesystem mounts in the containers private
FS namespace. In the VPS case, we'll do a pivot_root() onto the
new root directory, and then add any extra filesystem mounts the
container config requested.

The stdin/out/err of the process leader in the container is bound
to the slave end of a Psuedo TTY, libvirt owning the master end
so it can provide a virtual text console into the guest container.
Once the basic container setup is complete, libvirt exec the so 
called 'init' process. Things are thus setup such that when the 
'init' process exits, the container is terminated / cleaned up.

On the host side, the libvirt LXC driver creates what we call a
'controller' process for each container. This is done with a small
binary /usr/libexec/libvirt_lxc. This is the process which owns the
master end of the Pseduo-TTY, along with a second Pseduo-TTY pair.
When the host admin wants to interact with the contain, they use
the command 'virsh console CONTAINER-NAME'. The LXC controller
process takes care of forwarding I/O between the two slave PTYs,
one slave opened by virsh console, the other being the containers'
stdin/out/err. If you kill the controller, then the container
also dies. Basically you can think of the libvirt_lxc controller
as serving the equivalent purpose to the 'qemu' command for full
machine virtualization - it provides the interface between host
and guest, in this case just the container setup, and access to
text console - perhaps more in the future.

For networking, libvirt provides two core concepts

 - Shared physical device. A bridge containing one of your
   physical network interfaces on the host, along with one or
   more of the guest vnet interfaces. So the container appears
   as if its directly on the LAN

 - Virtual network. A bridge containing only guest vnet
   interfaces, and NO physical device from the host. IPtables
   and forwarding provide routed (+ optionally NATed)
   connectivity to the LAN for guests.

The latter use case is particularly useful for machines without
a permanent wired ethernet - eg laptops, using wifi, as it lets
guests talk to each other even when there's no active host network.
Both of these network setups are fully supported in the LXC driver
in precense of a suitably new host kernel.

That's a 100ft overview and the current functionality is working
quite well from an architectural/technical point of view, but there
is plenty more work we still need todo to provide an system which
is mature enough for real world production deployment.

 - Integration with cgroups. Although I talked about resource
   restrictions, we've not implemented any of this yet. In the
   most immediate timeframe we want to use cgroups' device
   ACL support to prevent the container having any ability to
   access to device nodes other than the usual suspects of
   /dev/{null,full,zero,console}, and possibly /dev/urandom.
   The other important one is to provide a memory cap across
   the entire container. CPU based resource control is lower
   priority at the moment.

 - Efficient query of resource utilization. We need to be able
   to get the cumulative CPU time of all the processes inside 
   the container, without having to iterate over every PIDs'
   /proc/$PID/stat file. I'm not sure how we'll do this yet..
   We want to get this data this for all CPUs, and per-CPU.

 - devpts virtualization. libvirt currently just bind mount the
   host's /dev/pts into the container. Clearly this isn't a
   serious impl. We've been monitoring the devpts namespace
   patches and these look like they will provide the capabilities
   we need for the full virtual private server use case

 - network sysfs virtualization. libvirt can't currently use the
   CLONE_NEWNET flag in most Linux distros, since current released
   kernel has this capability conflicting with SYSFS in KConfig.
   Again we're looking forward to seeing this addressed in next
   kernel

 - UID/GID virtualization. While we spawn all containers as root,
   applications inside the container may witch to unprivileged
   UIDs. We don't (neccessarily) want users in the host with
   equivalent UIDs to be able to kill processes inside the
   container. It would also be desirable to allow unprivileged
   users to create containers without needing root on the host,
   but allowing them to be root & any other user inside their
   container. I'm not aware if anyone's working on this kind of
   thing yet ?

There're probably more things Dan Smith is thinking of but that
list is a good starting point.

Finally, a 30 second overview of actually using LXC usage with
libvirt to create a simple VPS using busybox in its root fs...

 - Create a simple chroot environment using busybox

    mkdir /root/mycontainer
    mkdir /root/mycontainer/bin
    mkdir /root/mycontainer/sbin
    cp /sbin/busybox /root/mycontainer/sbin
    for cmd in sh ls chdir chmod rm cat vi
    do
      ln -s /root/mycontainer/bin/$cmd ../sbin/busybox
    done
    cat > /root/mycontainer/sbin/init <<EOF
    #!/sbin/busybox
    sh
    EOF

 - Create a simple libvirt configuration file for the
   container, defining the root filesystem, the network
   connection (bridged to br0 in this case), and the
   path to the 'init' binary (defaults to /sbin/init if
   omitted)

    # cat > mycontainer.xml <<EOF
    <domain type='lxc'>
      <name>mycontainer</name>
      <memory>500000</memory>
      <os>
        <type>exe</type>
        <init>/sbin/init</init>
      </os>
      <devices>
        <filesystem type='mount'>
          <source dir='/root/mycontainer'/>
          <target dir='/'/>
        </filesystem>
        <interface type='bridge'>
          <source network='br0'/>
          <mac address='00:11:22:34:34:34'/>
        </interface>
        <console type='pty' />
      </devices>
    </domain>
    EOF

 - Load the configuration into libvirt

    # virsh --connect lxc:/// define mycontainer.xml
    # virsh --connect lxc:/// list --inactive
     Id Name                 State
    ----------------------------------
     -  mycontainer          shutdown

 - Start the VM and query some information about it

    # virsh --connect lxc:/// start mycontainer
    # virsh --connect lxc:/// list
     Id   Name                 State
    ----------------------------------
    28407 mycontainer          running

    # virsh --connect lxc:/// dominfo mycontainer
    Id:             28407
    Name:           mycontainer
    UUID:           8369f1ac-7e46-e869-4ca5-759d51478066
    OS Type:        exe
    State:          running
    CPU(s):         1
    Max memory:     500000 kB
    Used memory:    500000 kB

   NB. the CPU/memory info here is not enforce yet.

 - Interact with the container

    # virsh --connect lxc:/// console mycontainer

   NB, Ctrl+] to exit when done

 - Query the live config - eg to discover what PTY its
   console is connected to

    # virsh --connect lxc:/// dumpxml mycontainer
    <domain type='lxc' id='28407'>
      <name>mycontainer</name>
      <uuid>8369f1ac-7e46-e869-4ca5-759d51478066</uuid>
      <memory>500000</memory>
      <currentMemory>500000</currentMemory>
      <vcpu>1</vcpu>
      <os>
        <type arch='i686'>exe</type>
        <init>/sbin/init</init>
      </os>
      <clock offset='utc'/>
      <on_poweroff>destroy</on_poweroff>
      <on_reboot>restart</on_reboot>
      <on_crash>destroy</on_crash>
      <devices>
        <filesystem type='mount'>
          <source dir='/root/mycontainer'/>
          <target dir='/'/>
        </filesystem>
        <console type='pty' tty='/dev/pts/22'>
          <source path='/dev/pts/22'/>
          <target port='0'/>
        </console>
      </devices>
    </domain>

 - Shutdown the container

    # virsh --connect lxc:/// destroy mycontainer

There is lots more I could say, but hopefully this serves as
a useful introduction to the LXC work in libvirt and how it
is making use of the kernel's container based virtualization
support. For those interested in finding out more, all the
source is in the libvirt CVS repo, the files being those
named  src/lxc_conf.c, src/lxc_container.c, src/lxc_controller.c
and src/lxc_driver.c. 

   http://libvirt.org/downloads.html

or via the GIT mirror of our CVS repo

   git clone git://git.et.redhat.com/libvirt.git

Regards,
Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers