Re: [PATCH 1/6] Add new API virDomainStreamDisk[Info] to header and drivers

Anthony Liguori <anthony@xxxxxxxxxxxxx> · Mon, 11 Apr 2011 17:06:54 -0500

On 04/11/2011 04:45 PM, Daniel P. Berrange wrote:
On Fri, Apr 08, 2011 at 02:26:48PM -0500, Anthony Liguori wrote:
On 04/08/2011 11:02 AM, Stefan Hajnoczi wrote:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange<berrange@xxxxxxxxxx>   wrote:

I have CCed Anthony and Kevin.  Anthony drove the QED image streaming
and Kevin will probably be interested in the idea of allocating raw
images as a background activity while QEMU runs.

    /*
     * @path: fully qualified filename of the virtual disk
     * @nregions: filled in the number of @region structs
     * @regions: filled with a list of allocated regions
     *
     * Query the extents of allocated regions within the
     * virtual disk file. The offsets in the list of regions
     * are not guarenteed to be sorted in any explicit order.
     */
    int virDomainBlockGetAllocationMap(virDomainPtr dom,
                                       const char *path,
                                       unsigned int *nregions,
                                       virDomainBlockRegionPtr *regions);
QEMU can provide this with its existing .bdrv_is_allocated() function.
  Kevin, do you have any thoughts on whether this API will work well?
I think the trouble with this API proposal is that it's overloading
concepts.

Sparse is not the same thing as CoW to a backing file.
I don't like to use the term "sparse", since that implies a specific disk
format (raw file with holes). Rather I use the term 'thin provisioned'
to refer to any disk format, where the not all physical sectors have
yet been allocated. A thin-provisioned disk, can trivially be thought
of as a disk, with a backing file whose sectors are all filled with
zeros.

It's not so black and white today.

Imagine that you had a qcow2 file, and you "streamed" it such that it 
was no longer "thin provisioned", as soon as the guest starts issuing 
trim/discards, QEMU could conceivably start defragmenting the image and 
truncating resulting in a sparse file.

The only time the concept of "fully allocated" really makes sense is for 
a raw image on a simple file system.   Once you start dealing with 
things like btrfs and deduplication, and of those useful guarantees are 
thrown out the window.

I think the real question is, why do you care about what physical 
sectors reside where?  What problem are you trying to solve?

For instance, when you expose streaming, the result is still a
sparse file.  So you'd have a rather curious API where you called to
"allocate" a region in the file which resulted in having a sparse
file which you then called again to make it non sparse.  But AFAICT,
the API doesn't really tell you these details.
Copy-on-read streaming does not imply that the result is still
thin-provisioned. That is a policy decision by the management
application.

I think your notion of thin-provision doesn't quite map to how things 
work today.  Unless you're in a very constrained environment, you're 
always thin provisioned.

Having to related APIs to expand a copy-on-read image and then to
fill in a sparse file is certainly a reasonable thing to do.  I
think trying to make a single API that does both without having a
flag that basically makes it two APIs is going to be cumbersome.
On the contrary, having a single API makes life *simpler*. It doesn't
require any special flag to distinguish the two use cases, since they
are fundamentally the same thing. Some examples, which include the
implicit "all zeros" backing file that every disk has, should illustrate
this

  - Make a brand new thin-provisioned disk, no backing store,
    fully allocated

    |0|0|0|0|0|0|0|0|0|
    | | | | | | | | | |   ->      |0|0|0|0|0|0|0|0|0|

  - Make a brand new thin-provisioned disk, no backing store,
    1/2 allocated

    |0|0|0|0|0|0|0|0|0|          |0|0|0|0|0|0|0|0|0|
    | | | | | | | | | |   ->      |0|0|0|0|0| | | | |

  - Make a existing, thin-provisioned disk, no backing store,
    fully allocated

    |0|0|0|0|0|0|0|0|0|
    |X| |X|X| | |X| |X|   ->      |X|0|X|X|0|0|X|0|X|

  - Make a existing, thin-provisioned disk, no backing store,
    1/2 allocated

    |0|0|0|0|0|0|0|0|0|          |0|0|0|0|0|0|0|0|0|
    |X| |X|X| | |X| |X|   ->      |X|0|X|X|0| |X| |X|

  - Make a brand new thin-provisioned disk, with backing store,
    independant of backing store, but still thin:

    |0|0|0|0|0|0|0|0|0|
    |X| |X|X| | |X| |X|          |0|0|0|0|0|0|0|0|0|
    | | | | | | | | | |   ->      |X| |X|X| | |X| |X|

  - Make a existing thin-provisioned disk, with backing store,
    independant of backing store, but still thin

    |0|0|0|0|0|0|0|0|0|
    |X| |X|X| | |X| |X|          |0|0|0|0|0|0|0|0|0|
    |Y|Y|Y| | | | | | |   ->      |X| |X|X| | |X| |X|

  - Make a existing thin-provisioned disk, with backing store,
    independant of backing store, fully allocated

    |0|0|0|0|0|0|0|0|0|
    |X| |X|X| | |X| |X|
    |Y|Y|Y| | | | | | |   ->      |X|0|X|X|0|0|X|0|X|

  - Make a brand new thin-provisioned disk, with 2 backing stores,
    independant of backing stores&  fully allocated:

    |0|0|0|0|0|0|0|0|0|
    | | |Z|Z| | | |Z| |
    |X| |X| | | |X| |X|
    |Y|Y| |Y| | | | | |   ->      |Y|Y|X|Y|0|0|X|Z|X|

etc, etc for many more example scenarios. Cow-on-read streaming is really
not a special case - it is just one of many example scenarios, all of
which can be managed via the pair of APIs mentioned earlier.

It's just not this simple with modern file systems unfortunately.

The problem is your mixing a filesystem concept (sparseness) with a 
purely QEMU concept (backing file).  Streaming is the process of merging 
a backing file into the current image without disrupting the backing 
file.  When it is completed and the two are fully merged, the current 
image no longer has a dependency on the backing file.

It's essentially a reverse snapshot merge and is probably close to 
snapshot merging conceptually than image sparseness.

Regards,

Anthony Liguori

Regards,
Daniel

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list