Re: [PATCH 5/6] qemu: Leave cpuset.mems in parent cgroup alone

Martin Kletzander <mkletzan@xxxxxxxxxx> · Tue, 23 Dec 2014 12:40:23 +0100

On Mon, Dec 22, 2014 at 10:45:33AM +0100, Martin Kletzander wrote:
On Wed, Dec 17, 2014 at 04:59:30PM -0700, Eric Blake wrote:
On 12/17/2014 08:06 AM, Martin Kletzander wrote:
On Wed, Dec 17, 2014 at 12:00:36AM -0700, Eric Blake wrote:
On 12/16/2014 11:51 PM, Eric Blake wrote:
On 12/15/2014 12:58 AM, Martin Kletzander wrote:
Instead of setting the value of cpuset.mems once when the domain starts
and then re-calculating the value every time we need to change the
child
cgroup values, leave the cgroup alone and rather set the child data
every time there is new cgroup created.  We don't leave any task in the
parent group anyway.  This will ease both current and future code.

Signed-off-by: Martin Kletzander <mkletzan@xxxxxxxxxx>
---
 src/qemu/qemu_cgroup.c | 67
++++++++++++++++++++++++++++++++++++++++++++++++--
 src/qemu/qemu_driver.c | 59
+++++++++++++++-----------------------------
 2 files changed, 85 insertions(+), 41 deletions(-)

This patch causes libvirtd to segfault on startup:

More particularly, I'm doing an upgrade from older libvirtd to this
version, while leaving a transient domain running that was started by
the older libvirtd.  Hope that helps you narrow in on the problem.

Weird - At the time I made the report, I ran 'git bisect' and reliably
reproduced the crash across multiple libvirtd restarts (a restart for
each new build while trying to nail down the culprit commit), as long as
I left that transient domain running.

I managed to reproduce this (or actually talk to a person who did) and
found out a bit more.  As you say, until the domain was running
everything was reproducible.  However, when I started it with the old
libvirt second time to find out the root cause, it just worked...

I tried that and it works for me.  And I tried various domains, both
heavily cgroup dependent and simple ones.

But now that I've rebooted, and then proceeded to do incremental builds
from both before and after the patch, I can't reproduce the crash.
Although my formula for creating my transient domain was the same both
yesterday and today, there may be some difference in the version of
libvirtd that was running at the time I created the domain that then
affected the XML affecting the libvirtd restarts.

Reverting 86759e and af2a1f0 was sufficient to get me going again
locally, but I'm not going to push my reversions until you've first had
a chance to address the regression.

#2  0x00007ffff7405673 in virCgroupHasEmptyTasks (cgroup=0x0,
controller=2)
    at util/vircgroup.c:3935

From this line it looks like priv->cgroup was not initialized.  I did
not add a check for that, so that may be the cause.  I'll send a patch
soon.

But I wonder how did you manage to do that, is that a session libvirtd
you restarted?  Otherwise how come virCgroupNewDetectMachine() didn't
fill the cgroup?

I'm not spotting anything obvious why it wasn't initialized at the time
I reproduced the crash, nor why I can't reproduce it now.  Hopefully
it's not a lurking time bomb.

It might be and it may cause more problems that just non-working
hot-plugging of vCPUs.  And it is not caused by these patches I
added.  The thing is that old libvirt starts a domain, but it
*somehow* (this is the part I was not able to find out yet) doesn't
add the task into the cgroup hierarchy for all cgroup controllers.
The /proc/$(pidof qemu-kvm)/cgroups file looks like this then:

10:hugetlb:/
9:perf_event:/machine.slice/machine-qemu\x2da.scope
8:blkio:/
7:net_cls:/machine.slice/machine-qemu\x2da.scope
6:freezer:/machine.slice/machine-qemu\x2da.scope
5:devices:/
4:memory:/
3:cpuacct,cpu:/
2:cpuset:/machine.slice/machine-qemu\x2da.scope/emulator
1:name=systemd:/machine.slice/machine-qemu\x2da.scope

And when new libvirtd is starting, it detects the cgroup mounts and
placements (in this example the placement for memory cgroup is "/")
and the it goes and validates them.  If the cgroup doesn't have a
placement that ends with machine-qemu-blah or libvirt-qemu or
whatever, the following condition in virCgroupValidateMachineGroup()
fails:

if (STRNEQ(tmp, name) &&
   STRNEQ(tmp, partname) &&
   STRNEQ(tmp, scopename))

[in here the variable tmp points after the last slash in the partition
string]

The function returns false, hence virCgroupNewDetectMachine() frees
the cgroup (and sets it to NULL) and returns success.  This means we
will still properly detect the machine but we won't have access to any
cgroup settings of that particular machine.

The patch I posted fixes it to the extent that the daemon doesn't
crash, but it doesn't fix the root cause (which is unknown).

I played with it a bit more and I've found out that a domain, which
had its own partition in all the cgroups controllers when started,
*lost* some of them after a while.  I might have been caused by
libvirtd restarting, but that's highly unlikely, I guess.

Even though inotify works for cgroup fs it does not show us who
accessed it.  I'll have to resort to using systemtap again.

If it happens again, I'll definitely report it; but for now, without a
reliable reproduction, it could easily have been something caused on my
end (since it is my dev machine, it may have been caused by a half-baked
patch on my end that I was testing at the time I created the transient
domain, where using only upstream patches wouldn't see the issue).  So
for now, don't worry too hard if you can't find it either.

There is a valid code path that allows the cgroup to be NULL, although
not easy to reach.  However, since it is valid, we added a patch that
just skips the cgroup restoration for cgroup == NULL.

If you happen to stumble upon a similar problem, I'd be happy to try
dealing with it.

Martin

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list
Attachment:
signature.asc

Description: Digital signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list