On 12/5/18 1:16 AM, Paolo Bonzini wrote:
On 04/12/18 16:49, Christophe de Dinechin wrote:
Linux and QEMU's own qht work just fine with compile-time directives.
Wouldn’t it work fine without any compile-time directive at all?
Yes, that's what I meant. Though there are certainly cases in which the
difference without proper cacheline alignment is an order of magnitude
less throughput or something like that; it would certainly be noticeable.
I don't think lock-free lists are easier. Bitmaps smaller than 64
elements are both faster and easier to manage.
I believe that this is only true if you use a linked list for both freelist
management and for thread notification (i.e. to replace the bitmaps).
However, if you use an atomic list only for the free list, and keep
bitmaps for signaling, then performance is at least equal, often better.
Plus you get the added benefit of having a thread-safe API, i.e.
something that is truly lock-free.
I did a small experiment to test / prove this. Last commit on branch:
https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
Take with a grain of salt, microbenchmarks are always suspect ;-)
The code in “thread_test.c” includes Xiao’s code with two variations,
plus some testing code lifted from the flight recorder library.
1. The FREE_LIST variation (sl_test) is what I would like to propose.
2. The BITMAP variation (bm_test) is the baseline
3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
To run it, you need to do “make opt-test”, then run “test_script”
which outputs a CSV file. The summary of my findings testing on
a ThinkPad, a Xeon machine and a MacBook is here:
https://imgur.com/a/4HmbB9K
Overall, the proposed approach:
- makes the API thread safe and lock free, addressing the one
drawback that Xiao was mentioning.
- delivers up to 30% more requests on the Macbook, while being
“within noise” (sometimes marginally better) for the other two.
I suspect an optimization opportunity found by clang, because
the Macbook delivers really high numbers.
- spends less time blocking when all threads are busy, which
accounts for the higher number of client loops.
If you think that makes sense, then either Xiao can adapt the code
from the branch above, or I can send a follow-up patch.
Having a follow-up patch would be best I think. Thanks for
experimenting with this, it's always fun stuff. :)
Yup, Christophe, please post the follow-up patches and add yourself
to the author list if you like. I am looking forward to it. :)
Thanks!