Re: [PATCH net-next] net/smc: Optimize the search method of reused buf_desc

Li Qiang <liqiang64@xxxxxxxxxx> · Mon, 4 Nov 2024 16:47:43 +0800

在 2024/11/4 16:13, Dust Li 写道:
> On 2024-11-02 14:43:52, Li Qiang wrote:
>>
>>
>> 在 2024/11/1 18:52, Dust Li 写道:
>>> On 2024-11-01 16:23:42, liqiang wrote:
>>>> connections based on redis-benchmark (test in smc loopback-ism mode):
>>> ...
>>> ```
>>
>> I tested with nginx, the test command is:
>> # server
>> smc_run nginx
>>
>> # client
>> smc_run wrk -t <2,4,8,16,32,64> -c 200 -H "Connection: close" http://127.0.0.1
>>
>> Requests/sec
>> --------+---------------+---------------+
>> req/s	| without patch	| apply patch	|
>> --------+---------------+---------------+
>> -t 2	|6924.18	|7456.54	|
>> --------+---------------+---------------+
>> -t 4	|8731.68	|9660.33	|
>> --------+---------------+---------------+
>> -t 8	|11363.22	|13802.08	|
>> --------+---------------+---------------+
>> -t 16	|12040.12	|18666.69	|
>> --------+---------------+---------------+
>> -t 32	|11460.82	|17017.28	|
>> --------+---------------+---------------+
>> -t 64	|11018.65	|14974.80	|
>> --------+---------------+---------------+
>>
>> Transfer/sec
>> --------+---------------+---------------+
>> trans/s	| without patch	| apply patch	|
>> --------+---------------+---------------+
>> -t 2	|24.72MB	|26.62MB	|
>> --------+---------------+---------------+
>> -t 4	|31.18MB	|34.49MB	|
>> --------+---------------+---------------+
>> -t 8	|40.57MB	|49.28MB	|
>> --------+---------------+---------------+
>> -t 16	|42.99MB	|66.65MB	|
>> --------+---------------+---------------+
>> -t 32	|40.92MB	|60.76MB	|
>> --------+---------------+---------------+
>> -t 64	|39.34MB	|53.47MB	|
>> --------+---------------+---------------+
>>
>>>
>>>>
>>>>    1. On the current version:
>>>>        [x.832733] smc_buf_get_slot cost:602 ns, walk 10 buf_descs
>>>>        [x.832860] smc_buf_get_slot cost:329 ns, walk 12 buf_descs
>>>>        [x.832999] smc_buf_get_slot cost:479 ns, walk 17 buf_descs
>>>>        [x.833157] smc_buf_get_slot cost:679 ns, walk 13 buf_descs
>>>>        ...
>>>>        [x.045240] smc_buf_get_slot cost:5528 ns, walk 196 buf_descs
>>>>        [x.045389] smc_buf_get_slot cost:4721 ns, walk 197 buf_descs
>>>>        [x.045537] smc_buf_get_slot cost:4075 ns, walk 198 buf_descs
>>>>        [x.046010] smc_buf_get_slot cost:6476 ns, walk 199 buf_descs
>>>>
>>>>    2. Apply this patch:
>>>>        [x.180857] smc_buf_get_slot_free cost:75 ns
>>>>        [x.181001] smc_buf_get_slot_free cost:147 ns
>>>>        [x.181128] smc_buf_get_slot_free cost:97 ns
>>>>        [x.181282] smc_buf_get_slot_free cost:132 ns
>>>>        [x.181451] smc_buf_get_slot_free cost:74 ns
>>>>
>>>> It can be seen from the data that it takes about 5~6us to traverse 200 
>>>
>>> Based on your data, I'm afraid the short-lived connection
>>> test won't show much benificial. Since the time to complete a
>>> SMC-R connection should be several orders of magnitude larger
>>> than 100ns.
>>
>> Sorry, I didn't explain my test data well before.
>>
>> The main optimized functions of this patch are as follows:
>>
>> ```
>> struct smc_buf_desc *smc_buf_get_slot(...)
>> {
>> 	struct smc_buf_desc *buf_slot;
>>        down_read(lock);
>>        list_for_each_entry(buf_slot, buf_list, list) {
>>                if (cmpxchg(&buf_slot->used, 0, 1) == 0) {
>>                        up_read(lock);
>>                        return buf_slot;
>>                }
>>        }
>>        up_read(lock);
>>        return NULL;
>> }
>> ```
>> ...
>>
>> The optimized code is as follows:
>>
>> ```
>> static struct smc_buf_desc *smc_buf_get_slot_free(struct llist_head *buf_llist)
>> {
>>        struct smc_buf_desc *buf_free;
>>        struct llist_node *llnode;
>>
>>        if (llist_empty(buf_llist))
>>                return NULL;
>>        // lock-less link list don't need an lock
>          ^^^ kernel use /**/ for comments

Ok I will change it. :-)

> 
>>        llnode = llist_del_first(buf_llist);
>>        buf_free = llist_entry(llnode, struct smc_buf_desc, llist);
> 
> If 2 CPU both passed the llist_empty() check, only 1 CPU can get llnode,
> the other one should be NULL ?

Well, what you said makes sense, I think the previous judgment of llist_empty
is useless and can be deleted. This function should be changed to:
```
static struct smc_buf_desc *smc_buf_get_slot_free(struct llist_head *buf_llist)
{
	struct smc_buf_desc *buf_free;
	struct llist_node *llnode;

	/* lock-less link list don't need an lock */
	llnode = llist_del_first(buf_llist);
        if (llnode == NULL)
            return NULL;
	buf_free = llist_entry(llnode, struct smc_buf_desc, llist);
	WRITE_ONCE(buf_free->used, 1);
	return buf_free;
}
```

If there is only one node left in the linked list, multiple CPUs will
compete based on CAS instructions in llist_del_first. In the end, only
one consumer will get the node, and other consumers will get the null pointer.

Thank you!

> 
>>        WRITE_ONCE(buf_free->used, 1);
>>        return buf_free;
>> }
>> ```

-- 
Best regards,
Li Qiang