Re: [RFC PATCH 5/5] qemu: snapshot: Improve approach to deal with snapshot metadata

Eric Blake <eblake@xxxxxxxxxx> · Thu, 12 Jun 2014 17:58:32 -0600

[warning - long email ahead. tl;dr: this patch is not ready yet, because
I found a problem in the design]

On 06/12/2014 09:02 AM, Peter Krempa wrote:
> Until now we were changing information about the disk source via
> multiple steps of copying data. Now that we changed to a pointer to
> store the disk source we might use it to change the approach to track
> the data.
> 
> Additionally this will allow proper tracking of the backing chain.
> ---
>  src/qemu/qemu_driver.c | 108 +++++++++----------------------------------------
>  1 file changed, 19 insertions(+), 89 deletions(-)

It's also a nice diffstat :)

> @@ -12863,26 +12859,12 @@ qemuDomainSnapshotCreateSingleDiskActive(virQEMUDriverPtr driver,
>      if (virAsprintf(&device, "drive-%s", disk->info.alias) < 0)
>          goto cleanup;
> 
> -    /* XXX Here, we know we are about to alter disk->src->backingStore if
> -     * successful, so we nuke the existing chain so that future commands will
> -     * recompute it.  Better would be storing the chain ourselves rather than
> -     * reprobing, but this requires modifying domain_conf and our XML to fully
> -     * track the chain across libvirtd restarts.  */
> -    virStorageSourceClearBackingStore(disk->src);
> -

Yay - moving the code in the direction that the comments hinted we
should be moving.

>      if (virStorageFileInit(snap->src) < 0)
>          goto cleanup;
> 
>      if (qemuGetDriveSourceString(snap->src, NULL, &source) < 0)
>          goto cleanup;
> 
> -    if (VIR_STRDUP(newsource, snap->src->path) < 0)
> -        goto cleanup;
> -
> -    if (persistDisk &&
> -        VIR_STRDUP(persistSource, snap->src->path) < 0)
> -        goto cleanup;

One thing this code was trying to do is hoist all allocation up front,
so that if we hit OOM, we have not made any permanent changes and can
therefore safely roll back.

> @@ -12969,33 +12942,18 @@ qemuDomainSnapshotCreateSingleDiskActive(virQEMUDriverPtr driver,
>      /* Update vm in place to match changes.  */
>      need_unlink = false;
> 
> -    VIR_FREE(disk->src->path);
> -    virStorageNetHostDefFree(disk->src->nhosts, disk->src->hosts);
> -
> -    disk->src->path = newsource;
> -    disk->src->format = format;
> -    disk->src->type = snap->src->type;
> -    disk->src->protocol = snap->src->protocol;
> -    disk->src->nhosts = snap->src->nhosts;
> -    disk->src->hosts = newhosts;
> +    if (!(tmp = virStorageSourceCopy(snap->src, false)))
> +        goto cleanup;

But here, you've delayed the allocation until after we've already made
some changes.  You'll need to hoist the call to virStorageSourceCopy()
to occur in the same place as the code it is replacing.

> 
> -    newsource = NULL;
> -    newhosts = NULL;
> +    tmp->backingStore = disk->src;
> +    disk->src = tmp;

This line looks like it is in the correct location - we are merely
manipulating the linked list in place to insert the newly-created wrapper.

I do see a potential problem, though.  Lots of graphics below, in three
major demo sections:

==================
In the _existing_ code, we are doing in-place modification of _portions_
of the storage source, then calling the security/audit/lease manager on
the modified source.  Why?  Because the portions of the disk that we do
NOT modify include the labeling information, and that's in part because
historically we had multiple pieces of labeling information stored only
once at the disk level.  Graphically, our current behavior has a
lifecycle something like this, when booting a domain with a chain of 2
and snapshotting it into a chain of 3:

at parse time, we only learn the top level, and the fact that we need to
generate multiple labels:

disk: dev = vda
      src =
+-------------------+
+ path = mid        +
+ rwlabel = NULL    +
+ rolabel = NULL    +
+ backing = NULL    +
+ mid is unlabeled  +
+-------------------+

at domain start time, we then probe the whole chain, which learns more
file names, but does not populate label information in the backing
files; we also perform labeling:

disk: dev = vda
      src =
+-------------------+  +--------------------+
+ path = mid        +  + path = base        +
+ rwlabel = foo     +  + rwlabel = NULL     +
+ rolabel = bar     +  + rolabel = NULL     +
+ backing = ---------->+ backing = NULL     +
+ mid has label foo +  + base has label bar +
+-------------------+  +--------------------+

Now the snapshot comes along, and we are adding a new layer:

snapshot: dev = vda
      src =
+-------------------+
+ path = top        +
+ rwlabel = NULL    +
+ rolabel = NULL    +
+ backing = NULL    +
+ top is unlabeled  +
+-------------------+

so we label it - but with what label? The label was tied to 'mid'.
Hence, the in-place swapping, so that we have:

disk: dev = vda
      src =
+-------------------+
+ path = top        +
+ rwlabel = foo     +
+ rolabel = bar     +
+ backing = NULL    +
+ top has label foo +
+-------------------+

long enough to give us a label.  If successful, the code takes a
shortcut at this point, and just reprobes the chain, finally leaving us
with the desired chain (if unsuccessful, we undo the temporary changes
by restoring src to mid, and backing to point to base).  But note that
the reprobe does NOT know what label backing elements actually have;
what's more, mid is still labeled 'foo' for read-write privileges, even
though after the snapshot we could have safely downgraded it to label
'bar' for read-only privileges:

disk: dev = vda
      src =
+-------------------+  +-------------------+  +--------------------+
+ src = top         +  + src = mid         +  + src = base         +
+ rwlabel = foo     +  + rwlabel = NULL    +  + rwlabel = NULL     +
+ rolabel = bar     +  + rolabel = NULL    +  + rolabel = NULL     +
+ backing = ---------->+ backing = ---------->+ backing = NULL     +
+ top has label foo +  + mid has label ??  +  + base has label ??  +
+-------------------+  +-------------------+  +--------------------+

===============

But with your code, merely inserting (a copy of) the snapshot source at
the front of the chain would leave us with:

disk: dev = vda
      src =
+-------------------+  +-------------------+  +--------------------+
+ src = top         +  + src = mid         +  + src = base         +
+ rwlabel = NULL    +  + rwlabel = foo     +  + rwlabel = NULL     +
+ rolabel = NULL    +  + rolabel = bar     +  + rolabel = NULL     +
+ backing = ---------->+ backing = ---------->+ backing = NULL     +
+ top has label foo +  + mid has label foo +  + base has label bar +
+-------------------+  +-------------------+  +--------------------+

it has the benefit of remembering what we have previously labeled
things, but has the absolute failure of forgetting what labels we have
generated for the <disk> (since those labels are buried in the backing
chain instead of the top element of the chain).  For your code to work,
you have to ALSO transfer the security information into the top of the
chain, probably best done via a new helper function.

===============
Here's what I envision that we should be doing instead.  The existing
code is tracking two different labels at the top level source, and none
at any other layer.  Instead, we should be tracking both read-write and
read-only label templates at the <disk> level, then for any given
virStorageSourcePtr, have a notion only of a current label.  The
security manager only needs access to the disk label templates (the
generated read-write and read-only labels don't change over the life of
the disk), as well as the current label per disk, along with the ability
to swap that label (we basically have three label states: unlabeled,
read-only, and read-write - and use the security manager to switch
between those states).  So the same lifecycle now looks more like:

original parse:

disk: dev = vda
      rwlabel = NULL
      rolabel = NULL
      src =
+-------------------+
+ path = mid        +
+ backing = NULL    +
+ mid is unlabeled  +
+-------------------+

probe the backing chain, and generate labels:

disk: dev = vda
      rwlabel = foo
      rolabel = bar
      src =
+-------------------+  +--------------------+
+ path = mid        +  + path = base        +
+ backing = ---------->+ backing = NULL     +
+ mid has label foo +  + base has label bar +
+-------------------+  +--------------------+

Parse the snapshot, and pass THAT storage source pointer to the security
code, but reusing the label template from the disk, so that we get the
correct label in-place:

snapshot: dev = vda
      src =
+-------------------+
+ path = top        +
+ backing = NULL    +
+ top has label foo +
+-------------------+

Wire up the chain:

disk: dev = vda
      rwlabel = foo
      rolabel = bar
      src =
+-------------------+  +-------------------+  +--------------------+
+ src = top         +  + src = mid         +  + src = base         +
+ backing = ---------->+ backing = ---------->+ backing = NULL     +
+ top has label foo +  + mid has label foo +  + base has label bar +
+-------------------+  +-------------------+  +--------------------+

and finally revoke the read-write label on mid:

disk: dev = vda
      rwlabel = foo
      rolabel = bar
      src =
+-------------------+  +-------------------+  +--------------------+
+ src = top         +  + src = mid         +  + src = base         +
+ backing = ---------->+ backing = ---------->+ backing = NULL     +
+ top has label foo +  + mid has label bar +  + base has label bar +
+-------------------+  +-------------------+  +--------------------+

and if a read-only backing file is shared by more than one chain (or
even across more than one <domain>), now you can see why reference
counting becomes important, for knowing when it is safe to revoke
read-write or even to forbid granting read-write.

Furthermore, once we start allowing backing chain as user input instead
of crawling the chain ourselves, the per-source label can be overridden
for any element of the chain, instead of the current limitation of only
having an override label at the top level.

I don't think we are going to get to the idea setup of security
management overnight, so in the meantime, we need some crutches to make
sure that your code is not botching things.

=================

So, with that explanation out of the way, I'm resuming the review.  I
think we can get away with your method in the short term (the 2nd of the
3 sections above) IF you also manage to transfer labeling information to
the active layer of the chain, but this patch needs a respin because in
its current state it does not do that.

> 
>      if (persistDisk) {
> -        VIR_FREE(persistDisk->src->path);
> -        virStorageNetHostDefFree(persistDisk->src->nhosts,
> -                                 persistDisk->src->hosts);
> -
> -        persistDisk->src->path = persistSource;
> -        persistDisk->src->format = format;
> -        persistDisk->src->type = snap->src->type;
> -        persistDisk->src->protocol = snap->src->protocol;
> -        persistDisk->src->nhosts = snap->src->nhosts;
> -        persistDisk->src->hosts = persistHosts;
> +        if (!(tmp = virStorageSourceCopy(snap->src, false)))
> +            goto cleanup;

Again, this is a late allocation; we risk OOM after making changes that
can't be undone.  I'd rather have two temporary virStorageSourcePtr
objects allocated up front, and then plugged in at the end on success,
rather than risk late allocation failure.

> 
> @@ -13017,21 +12971,15 @@ qemuDomainSnapshotCreateSingleDiskActive(virQEMUDriverPtr driver,
>  static void
>  qemuDomainSnapshotUndoSingleDiskActive(virQEMUDriverPtr driver,
>                                         virDomainObjPtr vm,
> -                                       virDomainDiskDefPtr origdisk,
>                                         virDomainDiskDefPtr disk,
>                                         virDomainDiskDefPtr persistDisk,
>                                         bool need_unlink)

Arggh - I ran out of time to review the rest.  But hopefully this gives
you some ideas to think about.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment:
signature.asc

Description: OpenPGP digital signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list