On 04/12/18 16:49, Christophe de Dinechin wrote: >> Linux and QEMU's own qht work just fine with compile-time directives. > > Wouldn’t it work fine without any compile-time directive at all? Yes, that's what I meant. Though there are certainly cases in which the difference without proper cacheline alignment is an order of magnitude less throughput or something like that; it would certainly be noticeable. >> I don't think lock-free lists are easier. Bitmaps smaller than 64 >> elements are both faster and easier to manage. > > I believe that this is only true if you use a linked list for both freelist > management and for thread notification (i.e. to replace the bitmaps). > However, if you use an atomic list only for the free list, and keep > bitmaps for signaling, then performance is at least equal, often better. > Plus you get the added benefit of having a thread-safe API, i.e. > something that is truly lock-free. > > I did a small experiment to test / prove this. Last commit on branch: > https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue > Take with a grain of salt, microbenchmarks are always suspect ;-) > > The code in “thread_test.c” includes Xiao’s code with two variations, > plus some testing code lifted from the flight recorder library. > 1. The FREE_LIST variation (sl_test) is what I would like to propose. > 2. The BITMAP variation (bm_test) is the baseline > 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach > > To run it, you need to do “make opt-test”, then run “test_script” > which outputs a CSV file. The summary of my findings testing on > a ThinkPad, a Xeon machine and a MacBook is here: > https://imgur.com/a/4HmbB9K > > Overall, the proposed approach: > > - makes the API thread safe and lock free, addressing the one > drawback that Xiao was mentioning. > > - delivers up to 30% more requests on the Macbook, while being > “within noise” (sometimes marginally better) for the other two. > I suspect an optimization opportunity found by clang, because > the Macbook delivers really high numbers. > > - spends less time blocking when all threads are busy, which > accounts for the higher number of client loops. > > If you think that makes sense, then either Xiao can adapt the code > from the branch above, or I can send a follow-up patch. Having a follow-up patch would be best I think. Thanks for experimenting with this, it's always fun stuff. :) Paolo