On 12.09.2014 14:38, Stefan Hajnoczi wrote:
Max: Unrelated to this performance issue but I notice that the qcow2 metadata overlap check is high in the host CPU profile. Have you had any thoughts about optimizing the check? Stefan
In fact, I have done so (albeit only briefly). Instead of gathering all the information in the overlap function itself, we could either have a generic list of typed ranges (e.g. "cluster 0: header", "clusters 1 to 5: L1 table", etc.) or a not-really-bitmap (with 4 bits per entry specifying the cluster type (header, L1 table, free or data cluster, etc.)).
The disadvantage of the former would be that in its simplest form we'd have to run through the whole list to find out whether a cluster is already reserved for metadata or not. We could easily optimize this by keeping the list in order and then performing a binary search.
The disadvantage of the latter would obviously be its memory size. For a 1 TB image with 64 kB clusters, it would be 8 MB in size. Could be considered acceptable, but I deem it too large. The advantage would be constant access time, of course.
We could combine both approaches, that is, using the bitmap as a cache: Whenever a cluster is overlap checked, the corresponding bitmap range (or "bitmap window") is requested; if that is not available, it is generated from the range list and then put into the cache.
The remaining question is how large the range list would be in memory. Basically, its size would be comparable to an RLE version of the bitmap. In contrast to a raw RLE version, however, we'd have to add the start cluster to each entry in order to be able to perform binary search and we'd omit free and/or data clusters. So, we'd have 4 bits for the cluster type, let's say 12 bits for the cluster count and of course 64 bits for the first cluster index. Or, for maximum efficiency, we'd have 64 - 9 - 1 = 54 bits for the cluster index, 4 bits for the type and then 6 bits for the cluster count. The first variant gives us 10 bytes per metadata range, the second 8. Considering one refcount block can handle cluster_size / 2 entries and one L2 table can handle cluster_size / 8 entries, we have (for images with a cluster size of 64 kB) a ratio of about 1/32768 refcount blocks per cluster and 1/8192 L2 tables per cluster. I guess we therefore have a metadata ratio of about 1/6000. At the worst, each metadata cluster requires its own range list entry, which for 10 bytes per entry means less than 30 kB for the list of a 1 TB image with 64 kB clusters. I think that's acceptable.
We could compress that list even more by making it a real RLE version of the bitmap, removing the cluster index from each entry; remember that for this mixed range list/bitmap approach we no longer need to be able to perform exact binary search but only need to be able to quickly seek to the beginning of a bitmap window. This can be achieved by forcing breaks in the range list at every window border and keeping track of those offsets along with the corresponding bitmap window index. When we want to generate a bitmap window, we look up the start offset in the range list (constant time), then generate it (linear to window size) and can then perform constant-time lookups for each overlap checks in that window.
I think that could greatly speed things up and also allow us to always perform range checks on data structures not kept in memory (inactive L1 and L2 tables). The only question now remaining to me is whether that caching is actually feasible or whether binary search into the range list (which then would have to include the cluster index for each entry) would be faster than generating bitmap windows which might suffer from ping-pong effects.
Max -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html