Re: [PATCH v3 2/5] util: introduce threaded workqueue

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Tue, 4 Dec 2018 18:16:36 +0100

On 04/12/18 16:49, Christophe de Dinechin wrote:
>>  Linux and QEMU's own qht work just fine with compile-time directives.
> 
> Wouldn’t it work fine without any compile-time directive at all?

Yes, that's what I meant.  Though there are certainly cases in which the
difference without proper cacheline alignment is an order of magnitude
less throughput or something like that; it would certainly be noticeable.

>> I don't think lock-free lists are easier.  Bitmaps smaller than 64
>> elements are both faster and easier to manage.
> 
> I believe that this is only true if you use a linked list for both freelist
> management and for thread notification (i.e. to replace the bitmaps).
> However, if you use an atomic list only for the free list, and keep
> bitmaps for signaling, then performance is at least equal, often better.
> Plus you get the added benefit of having a thread-safe API, i.e.
> something that is truly lock-free.
> 
> I did a small experiment to test / prove this. Last commit on branch:
> https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
> Take with a grain of salt, microbenchmarks are always suspect ;-)
> 
> The code in “thread_test.c” includes Xiao’s code with two variations,
> plus some testing code lifted from the flight recorder library.
> 1. The FREE_LIST variation (sl_test) is what I would like to propose.
> 2. The BITMAP variation (bm_test) is the baseline
> 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
> 
> To run it, you need to do “make opt-test”, then run “test_script”
> which outputs a CSV file. The summary of my findings testing on
> a ThinkPad, a Xeon machine and a MacBook is here:
> https://imgur.com/a/4HmbB9K
> 
> Overall, the proposed approach:
> 
> - makes the API thread safe and lock free, addressing the one
> drawback that Xiao was mentioning.
> 
> - delivers up to 30% more requests on the Macbook, while being
> “within noise” (sometimes marginally better) for the other two.
> I suspect an optimization opportunity found by clang, because
> the Macbook delivers really high numbers.
> 
> - spends less time blocking when all threads are busy, which
> accounts for the higher number of client loops.
> 
> If you think that makes sense, then either Xiao can adapt the code
> from the branch above, or I can send a follow-up patch.

Having a follow-up patch would be best I think.  Thanks for
experimenting with this, it's always fun stuff. :)

Paolo