Currently, request->csd has type struct __call_single_data. call_single_data_t is defined in include/linux/smp.h : /* Use __aligned() to avoid to use 2 cache lines for 1 csd */ typedef struct __call_single_data call_single_data_t __aligned(sizeof(struct __call_single_data)); As the comment above the typedef suggests, having this struct split between 2 cachelines causes the need to fetch / invalidate / bounce 2 cachelines instead of 1 when the cpu receiving the request gets to run the requested function. This is usually bad for performance, due to one extra memory access and 1 extra cacheline usage. Changing request->csd was previously attempted in commit 966a967116e6 ("smp: Avoid using two cache lines for struct call_single_data") but at the time the union that contains csd was positioned near the top of struct request, only below a struct list_head, and this caused the issue of holes summing up 24 extra bytes in struct request. The struct size was restored back to normal by commit 4ccafe032005 ("block: unalign call_single_data in struct request") but it caused the csd to be possibly split in 2 cachelines again. As an example with a 64-bit machine with CONFIG_BLK_RQ_ALLOC_TIME=y CONFIG_BLK_WBT=y CONFIG_BLK_DEV_INTEGRITY=y CONFIG_BLK_INLINE_ENCRYPTION=y Will output pahole with: struct request { [...] union { struct __call_single_data csd; /* 240 32 */ u64 fifo_time; /* 240 8 */ }; /* 240 32 */ [...] } At this config, and any cacheline size between 32 and 256, will cause csd to be split between 2 cachelines: csd->node (16 bytes) in the first cacheline, and csd->func (8 bytes) & csd->info (8 bytes) in the second. During blk_mq_complete_send_ipi(), csd->func and csd->info are getting changed, and when it calls __smp_call_single_queue() csd->node will get changed. On the cpu which got the request, csd->func and csd->info get read by __flush_smp_call_function_queue() and csd->node gets changed by csd_unlock(), meaning the two cachelines containing csd will get accessed. To avoid this, it would be necessary to change request->csd back to csd_single_data_t, which may end up increasing the struct size. (In above example, it increased from 288 to 320 -> 32 bytes). In order to keep the csd_single_data_t and avoid the struct's size increase, move request->csd to the end of the struct. The rationale of this strategy is that for cachelines >= 32 bytes, there will never be used an extra cacheline for struct request: - If request->csd is 32-byte aligned, there is no change in the object. - If request->csd is not 32-byte aligned, and part of it is in a different cacheline, the whole csd is moved to that cacheline. - If request->csd is not 32-byte aligned, but it's all contained in the same cacheline (> 32 bytes), aligning it to 32 will just put it a few bytes forward in this cacheline. (In above example, the change kept the struct's size in 288 bytes). Convert request->csd to csd_single_data_t and move it to the end of struct request, so csd is never split between cachelines and don't use any extra cachelines. Signed-off-by: Leonardo Bras <leobras@xxxxxxxxxx> --- include/linux/blk-mq.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 06caacd77ed6..50ef86172621 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -189,16 +189,16 @@ struct request { } flush; }; - union { - struct __call_single_data csd; - u64 fifo_time; - }; - /* * completion callback. */ rq_end_io_fn *end_io; void *end_io_data; + + union { + call_single_data_t csd; + u64 fifo_time; + }; }; static inline enum req_op req_op(const struct request *req) -- 2.40.1