Re: [libvirt PATCH v3] cgroup/LXC: Do not condition availability of v2 by controllers

Pavel Hrdina <phrdina@xxxxxxxxxx> · Tue, 25 Oct 2022 11:33:56 +0200

On Mon, Oct 24, 2022 at 05:52:04PM +0200, Michal Koutný wrote:
> Hello.
> 
> (Sorry for a stitched mail below, I'm not subscribed to the ML so this is
> what I got from public archives. Please, keep me Cced.)
> 
> On 10/24/22 13:54, Pavel Hrdina wrote:
> > I don't like this at all and IMO this is incorrect fix of the issue you
> > are trying to address. In hybrid mode with systemd the cgroup v2
> > controller is not a real controller. It's something systemd uses for
> > process tracking and some other features. It is owned by systemd and we
> > should not touch it directly at all. 
> 
> I very much agree with this. Despite that I suggested the posted patch
> [1] because libvirt/lxc apparently violates systemd rules of exclusive
> access to cgroup hierarchies already. (At least that's how I understand
> the migration between libvirtd.service and target
> machine.slice/.../*.scope.) The patch is meant as a quick (and
> admittedly dirty) fix to the issue that users have been reporting with
> libvirt/lxc in the hybrid mode for several months.

Unfortunately this breaks a lot of things and we cannot use it even as
workaround. Libvirt code has few assumptions as that's how unified mode
works and when the cgroups v2 backend is enabled for hybrid topology we
assume that cpuacct, devices controllers are enabled. But in the case of
hybrid topology they are enabled by cgroups v1. Due to this change we no
longer limit access to devices and cpu accounting no longer works.

> > We need to use proper systemd APIs to make any changes to that
> > directory or if needed ask systemd to create cgroup with Delegate=yes
> > which in this case is probably also not the correct approach.
> 
> Yes, as I wrote in the commit message, there seems to be some
> inconsistency in what libvirt core vs libvirt/lxc uses to access
> cgroups. Using always systemd API would seem most consistent but I'd
> retain the current patch as a workaround provided it doesn't break
> things more (than the current state). 
> 
> On 10/24/22 13:28, Pavel Hrdina wrote:
> > What I've seen is that hybrid systemd environments have:
> > 
> >     /sys/fs/cgroup/unified
> >     /sys/fs/cgroup/memory
> >     /sys/fs/cgroup/blkio
> >     ...
> > 
> > but it doesn't mean that it is mounted in V1 cgroup filesystem.
> > 
> > In this case the /sys/fs/cgroup is tmpfs filesystem and there is nothing
> > mounted over V1.
> 
> As a more detailed explanation why the reversed order works:
> It relies on the tmpfs mount over the ro-path in the v1
> virCgroupBindMount so that ../unified mount directory can be created
> for the v2 mount.
> 
> (IIUC, this is related to the environment inside the container where
> libvirt/lxc attempts to prepare hybrid-like cgroup setup.)

I did a bit of digging and now I see what the issue is. The commit
message wording and explanation here was not enough for me to understand
the problem.

In the virCgroupV1BindMount we have a code to create /sys/fs/cgroup
directory before we start mounting the controllers, however, in the
virCgroupV2BindMount there is no code to do that as cgroups v2 in
unified mode are mounted directly at /sys/fs/cgroup.

But with the hybrid mode this doesn't work because we try to mount
/sys/fs/cgroup/unified without the /sys/fs/cgroup directory existing
and it fails.

This workaround works but I would still rather fix the root of the issue
by modifying both functions where they would create the necessary
directory only if it doesn't exist so there is no need to care about the
call order.

I'm going to send a patch to revert this one and we need to find yet
another solution to the issue.

Just a note for future patches, single patch should fix only one issue
so this patch should have been split into two patches.

Pavel

> HTH,
> Michal
> 
> [1] v1 in https://bugzilla.opensuse.org/show_bug.cgi?id=1183247#c35

Attachment:
signature.asc

Description: PGP signature