Re: [PATCH 10/12] migration: introduce lockless multithreads model

Xiao Guangrong <guangrong.xiao@xxxxxxxxx> · Thu, 28 Jun 2018 22:25:20 +0800

On 06/20/2018 02:52 PM, Peter Xu wrote:
On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.xiao@xxxxxxxxx wrote:
From: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxx>

Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free


Not sure how other people think, for me these information suites
better as cover letter.  For commit message, I would prefer to know
about something like: what this thread model can do; how the APIs are
designed and used; what's the limitations, etc.  After all until this
patch nowhere is using the new model yet, so these numbers are a bit
misleading.


Yes, i completely agree with you, i will remove it for its changelog.


Signed-off-by: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxx>
---
  migration/Makefile.objs |   1 +
  migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
  migration/threads.h     | 116 +++++++++++++++++++++

Again, this model seems to be suitable for scenarios even outside
migration.  So I'm not sure whether you'd like to generalize it (I
still see e.g. constants and comments related to migration, but there
aren't much) and put it into util/.

Sure, that's good to me. :)


  3 files changed, 382 insertions(+)
  create mode 100644 migration/threads.c
  create mode 100644 migration/threads.h

diff --git a/migration/Makefile.objs b/migration/Makefile.objs
index c83ec47ba8..bdb61a7983 100644
--- a/migration/Makefile.objs
+++ b/migration/Makefile.objs
@@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
  common-obj-y += xbzrle.o postcopy-ram.o
  common-obj-y += qjson.o
  common-obj-y += block-dirty-bitmap.o
+common-obj-y += threads.o
  
  common-obj-$(CONFIG_RDMA) += rdma.o
  
diff --git a/migration/threads.c b/migration/threads.c
new file mode 100644
index 0000000000..eecd3229b7
--- /dev/null
+++ b/migration/threads.c
@@ -0,0 +1,265 @@
+#include "threads.h"
+
+/* retry to see if there is avilable request before actually go to wait. */
+#define BUSY_WAIT_COUNT 1000
+
+static void *thread_run(void *opaque)
+{
+    ThreadLocal *self_data = (ThreadLocal *)opaque;
+    Threads *threads = self_data->threads;
+    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
+    ThreadRequest *request;
+    int count, ret;
+
+    for ( ; !atomic_read(&self_data->quit); ) {
+        qemu_event_reset(&self_data->ev);
+
+        count = 0;
+        while ((request = ring_get(self_data->request_ring)) ||
+            count < BUSY_WAIT_COUNT) {
+             /*
+             * wait some while before go to sleep so that the user
+             * needn't go to kernel space to wake up the consumer
+             * threads.
+             *
+             * That will waste some CPU resource indeed however it
+             * can significantly improve the case that the request
+             * will be available soon.
+             */
+             if (!request) {
+                cpu_relax();
+                count++;
+                continue;
+            }
+            count = 0;
+
+            handler(request);
+
+            do {
+                ret = ring_put(threads->request_done_ring, request);
+                /*
+                 * request_done_ring has enough room to contain all
+                 * requests, however, theoretically, it still can be
+                 * fail if the ring's indexes are overflow that would
+                 * happen if there is more than 2^32 requests are

Could you elaborate why this ring_put() could fail, and why failure is
somehow related to 2^32 overflow?

Firstly, I don't understand why it will fail.

As we explained in the previous mail:

| Without it we can easily observe a "strange" behavior that the thread will
| put the result to the global ring failed even if we allocated enough room
| for the global ring (its capability >= total requests), that's because
| these two indexes can be updated at anytime, consider the case that multiple
| get and put operations can be finished between reading ring->in and ring->out
| so that very possibly ring->in can pass the value readed from ring->out.
|
| Having this code, the negative case only happens if these two indexes (32 bits)
| overflows to the same value, that can help us to catch potential bug in the
| code.

Meanwhile, AFAIU your ring can even live well with that 2^32 overflow.
Or did I misunderstood?

Please refer to the code:
+        if (__ring_is_full(ring, in, out)) {
+            if (atomic_read(&ring->in) == in &&
+                atomic_read(&ring->out) == out) {
+                return -ENOBUFS;
+            }

As we allocated enough room for this global ring so there is the only case
that put data will fail that the indexes are overflowed to the same value.

This possibly 2^32 get/put operations happened on other threads and main
thread when this thread is reading these two indexes.


+                 * handled between two calls of threads_wait_done().
+                 * So we do retry to make the code more robust.
+                 *
+                 * It is unlikely the case for migration as the block's
+                 * memory is unlikely more than 16T (2^32 pages) memory.

(some migration-related comments; maybe we can remove that)

Okay, i will consider it to make it more general.

+Threads *threads_create(unsigned int threads_nr, const char *name,
+                        ThreadRequest *(*thread_request_init)(void),
+                        void (*thread_request_uninit)(ThreadRequest *request),
+                        void (*thread_request_handler)(ThreadRequest *request),
+                        void (*thread_request_done)(ThreadRequest *request))
+{
+    Threads *threads;
+    int ret;
+
+    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
+    threads->threads_nr = threads_nr;
+    threads->thread_ring_size = THREAD_REQ_RING_SIZE;

(If we're going to generalize this thread model, maybe you'd consider
  to allow specify this ring size as well?)

Good point, will do it.


+    threads->total_requests = threads->thread_ring_size * threads_nr;
+
+    threads->name = name;
+    threads->thread_request_init = thread_request_init;
+    threads->thread_request_uninit = thread_request_uninit;
+    threads->thread_request_handler = thread_request_handler;
+    threads->thread_request_done = thread_request_done;
+
+    ret = init_requests(threads);
+    if (ret) {
+        g_free(threads);
+        return NULL;
+    }
+
+    init_thread_data(threads);
+    return threads;
+}
+
+void threads_destroy(Threads *threads)
+{
+    uninit_thread_data(threads);
+    uninit_requests(threads, threads->total_requests);
+    g_free(threads);
+}
+
+ThreadRequest *threads_submit_request_prepare(Threads *threads)
+{
+    ThreadRequest *request;
+    unsigned int index;
+
+    index = threads->current_thread_index % threads->threads_nr;

Why round-robin rather than simply find a idle thread (still with
valid free requests) and put the request onto that?

Asked since I don't see much difficulty to achieve that, meanwhile for
round-robin I'm not sure whether it can happen that one thread stuck
due to some reason (e.g., scheduling reason?), while the rest of the
threads are idle, then would threads_submit_request_prepare() be stuck
for that hanging thread?


You concern is reasonable indeed, however, the RR is the simplest
algorithm to push one request to threads without figuring the
lightest thread out one by one which makes the main thread fast
enough.

And i think it generally works not bad for a load-balanced system,
further more, the good configuration we think is that if the user
uses N threads to compression, he should make sure the system should
have enough CPU resource to run these N threads.

We can improve it after this basic framework gets merged by using
more advanced distribution approach if we see it's needed in
the future.

diff --git a/migration/threads.h b/migration/threads.h
new file mode 100644
index 0000000000..eced913065
--- /dev/null
+++ b/migration/threads.h
@@ -0,0 +1,116 @@
+#ifndef QEMU_MIGRATION_THREAD_H
+#define QEMU_MIGRATION_THREAD_H
+
+/*
+ * Multithreads abstraction
+ *
+ * This is the abstraction layer for multithreads management which is
+ * used to speed up migration.
+ *
+ * Note: currently only one producer is allowed.
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@xxxxxxxxxxx>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"

I was told (more than once) that we should not include "osdep.h" in
headers. :) I'll suggest you include that in the source file.

Okay, good to know it. :)


+#include "hw/boards.h"

Why do we need this header?

Well, i need to figure out the right head files to include the declarations
we used. :)


+
+#include "ring.h"
+
+/*
+ * the request representation which contains the internally used mete data,
+ * it can be embedded to user's self-defined data struct and the user can
+ * use container_of() to get the self-defined data
+ */
+struct ThreadRequest {
+    QSLIST_ENTRY(ThreadRequest) node;
+    unsigned int thread_index;
+};
+typedef struct ThreadRequest ThreadRequest;
+
+struct Threads;
+
+struct ThreadLocal {
+    QemuThread thread;
+
+    /* the event used to wake up the thread */
+    QemuEvent ev;
+
+    struct Threads *threads;
+
+    /* local request ring which is filled by the user */
+    Ring *request_ring;
+
+    /* the index of the thread */
+    int self;
+
+    /* thread is useless and needs to exit */
+    bool quit;
+};
+typedef struct ThreadLocal ThreadLocal;
+
+/*
+ * the main data struct represents multithreads which is shared by
+ * all threads
+ */
+struct Threads {
+    const char *name;
+    unsigned int threads_nr;
+    /* the request is pushed to the thread with round-robin manner */
+    unsigned int current_thread_index;
+
+    int thread_ring_size;
+    int total_requests;
+
+    /* the request is pre-allocated and linked in the list */
+    int free_requests_nr;
+    QSLIST_HEAD(, ThreadRequest) free_requests;
+
+    /* the constructor of request */
+    ThreadRequest *(*thread_request_init)(void);
+    /* the destructor of request */
+    void (*thread_request_uninit)(ThreadRequest *request);
+    /* the handler of the request which is called in the thread */
+    void (*thread_request_handler)(ThreadRequest *request);
+    /*
+     * the handler to process the result which is called in the
+     * user's context
+     */
+    void (*thread_request_done)(ThreadRequest *request);
+
+    /* the thread push the result to this ring so it has multiple producers */
+    Ring *request_done_ring;
+
+    ThreadLocal per_thread_data[0];
+};
+typedef struct Threads Threads;

Not sure whether we can move Threads/ThreadLocal definition into the
source file, then we only expose the struct definition, along with the
APIs.

Yup, that's better indeed, thank you, Peter!