Re: [PATCH v3] kvm tools: Add QCOW level2 caching support

Pekka Enberg <penberg@xxxxxxxxxx> · Thu, 2 Jun 2011 23:37:15 +0300

On Thu, Jun 2, 2011 at 11:01 PM, Prasad Joshi <prasadjoshi124@xxxxxxxxx> wrote:
> QCOW uses two tables level1 (L1) table and level2 (L2) table. The L1 table
> points to offset of L2 table. When a QCOW image is probed, the L1 table is
> cached in the memory to avoid reading it from disk on every access. This
> caching improves the performance.
>
> The similar performance improvement can be observed when L2 tables are cached.
> It is impossible to cache all of the L2 tables because of the memory
> constraint. The patch adds L2 table caching capability for up to 128 L2 tables,
> it uses combination of RB tree and List to manage the L2 cached tables. The
> link list implementation helps in building simple LRU structure and RB tree
> helps to search cached table efficiently
>
> The performance numbers are below, the machine was started with following
> command line arguments
>
> $ ./kvm run -d /home/prasad/VMDisks/Ubuntu10.10_64_cilk_qemu.qcow \
>> --params "root=/dev/vda1" -m 1024
>
> Without QCOW caching
> ====================
> $ bonnie++ -d tmp/ -c 10 -s 2048
> Writing a byte at a time...done
> Writing intelligently...done
> Rewriting...done
> Reading a byte at a time...done
> Reading intelligently...done
> start 'em...done...done...done...done...done...
> Create files in sequential order...done.
> Stat files in sequential order...done.
> Delete files in sequential order...done.
> Create files in random order...done.
> Stat files in random order...done.
> Delete files in random order...done.
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency  10     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> prasad-virtual-m 2G  1043  99 555406  74 227605  55  5360  99 489080  68 +++++ +++
> Latency             24646us   48544us   57893us    6686us    3595us   21026us
> Version  1.96       ------Sequential Create------ --------Random Create--------
> prasad-virtual-mach -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
> Latency               343us    1175us     327us     555us      48us      82us
> 1.96,1.96,prasad-virtual-machine,10,1307043085,2G,,1043,99,555406,74,227605,55,
> 5360,99,489080,68,+++++,+++,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,
> +++++,+++,+++++,+++,24646us,48544us,57893us,6686us,3595us,21026us,343us,1175us,
> 327us,555us,48us,82us
>
> With QCOW caching
> =================
> $ bonnie++ -d tmp/ -c 10 -s 2048
> Writing a byte at a time...done
> Writing intelligently...done
> Rewriting...done
> Reading a byte at a time...done
> Reading intelligently...done
> start 'em...done...done...done...done...done...
> Create files in sequential order...done.
> Stat files in sequential order...done.
> Delete files in sequential order...done.
> Create files in random order...done.
> Stat files in random order...done.
> Delete files in random order...done.
> Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
> Concurrency  10     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> prasad-virtual-m 2G  1033  99 467899  64 182387  41  5422 100 338294  48 +++++ +++
> Latency             21549us   60585us   65723us    6331us   30014us   19994us
> Version  1.96       ------Sequential Create------ --------Random Create--------
> prasad-virtual-mach -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
>                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
> Latency               478us    1142us     344us     402us      72us      98us
> 1.96,1.96,prasad-virtual-machine,10,1307042839,2G,,1033,99,467899,64,182387,41,
> 5422,100,338294,48,+++++,+++,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,
> +++++,+++,+++++,+++,21549us,60585us,65723us,6331us,30014us,19994us,478us,1142us,
> 344us,402us,72us,98us
>
> Summary of performance numbers
> ==============================
> There is not much difference with sequential character operations are
> performed, the code with caching performed better by small margin. The caching
> code performance raised by 18% to 24% with sequential block output and by 44%
> for sequentail block input. Which is understandable as the Level2 table will
> always be cached after a write operation. Random seek operation worked slower
> with caching code.

I see performance _degradation_ in the raw data:

Before:

> Concurrency  10     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
> prasad-virtual-m 2G  1043  99 555406  74 227605  55  5360  99 489080  68 +++++ +++

After:

> prasad-virtual-m 2G  1033  99 467899  64 182387  41  5422 100 338294  48 +++++ +++

So that's drop from 555406 to 467899 K/sec for sequential block writes
and drop from 489080 K/sec to 338294 K/sec for sequential block reads.

Random seek latency shows _improvement_:

> Latency             24646us   48544us   57893us    6686us    3595us   21026us

> Latency             21549us   60585us   65723us    6331us   30014us   19994us

Am I reading the Bonnie report wrong or did you mix up the 'before'
and 'after' data?

Assuming the data is just mixed up, I'm not completely happy that
we're making random seeks almost 5% slower. Any ideas why that's
happening?

> +static int search_table(struct qcow *q, u64 **table, u64 offset)
> +{
> +       struct qcow_l2_cache *c;
> +
> +       *table = NULL;
> +
> +       c = search(&q->root, offset);
> +       if (!c)
> +               return -1;
> +
> +       /* Update the LRU state */
> +       list_del_init(&c->list);
> +       list_add_tail(&c->list, &q->lru_list);

Why not use list_move() here? The "update the LRU state" comment is
pretty useless. It would be more important to explain the reader how
the list is ordered (i.e. least recently used are at the head of the
list).

> @@ -17,6 +19,16 @@
>  #define QCOW2_OFLAG_COMPRESSED (1LL << 62)
>  #define QCOW2_OFLAG_MASK       (QCOW2_OFLAG_COPIED|QCOW2_OFLAG_COMPRESSED)
>
> +#define MAX_CACHE_NODES         128

Did you test the results with MAX_CACHE_NODES set to 1? Did you get
similar results than with no caching at all? I also wonder if we could
get away with even smaller cache than 128.

>  struct qcow_table {
>        u32                     table_size;
>        u64                     *l1_table;
> @@ -26,6 +38,11 @@ struct qcow {
>        void                    *header;
>        struct qcow_table       table;
>        int                     fd;
> +
> +       /* Level2 caching data structures */
> +       struct rb_root          root;
> +       struct list_head        lru_list;
> +       int                     no_cached;

I've said this many times: please don't invent strange new names.
Really, "no_cached" reads out like it's a boolean! Use 'nr_cached'
instead.

                        Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html