Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Fri, 08 Dec 2023 12:52:49 +0100

On Thu, Dec 07 2023 at 20:46, Jacob Pan wrote:
> On Wed, 06 Dec 2023 20:50:24 +0100, Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> wrote:
>> I don't understand what the whole copy business is about. It's
>> absolutely not required.
>
> My thinking is the following:
> The PIR cache line is contended by between CPU and IOMMU, where CPU can
> access PIR much faster. Nevertheless, when IOMMU does atomic swap of the
> PID (PIR included), L1 cache gets evicted. Subsequent CPU read or xchg will
> deal with invalid cold cache.
>
> By making a copy of PIR as quickly as possible and clearing PIR with xchg,
> we minimized the chance that IOMMU does atomic swap in the middle.
> Therefore, having less L1D misses.
>
> In the code above, it does read, xchg, and call_irq_handler() in a loop
> to handle the 4 64bit PIR bits at a time. IOMMU has a greater chance to do
> atomic xchg on the PIR cache line while doing call_irq_handler(). Therefore,
> it causes more L1D misses.

That makes sense and if we go there it wants to be documented.

> Without PIR copy:
>
> DMA memfill bandwidth: 4.944 Gbps
> Performance counter stats for './run_intr.sh 512 30':                                                             
>                                                                                                                    
>     77,313,298,506      L1-dcache-loads                                               (79.98%)                     
>          8,279,458      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (80.03%)                   
>     41,654,221,245      L1-dcache-stores                                              (80.01%)                     
>             10,476      LLC-load-misses           #    0.31% of all LL-cache accesses  (79.99%)                    
>          3,332,748      LLC-loads                                                     (80.00%)                     
>                                                                                                                    
>       30.212055434 seconds time elapsed                                                                            
>                                                                                                                    
>        0.002149000 seconds user                                                                                    
>       30.183292000 seconds sys
>                         
>
> With PIR copy:
> DMA memfill bandwidth: 5.029 Gbps
> Performance counter stats for './run_intr.sh 512 30':
>
>     78,327,247,423      L1-dcache-loads                                               (80.01%)
>          7,762,311      L1-dcache-load-misses     #    0.01% of all L1-dcache accesses  (80.01%)
>     42,203,221,466      L1-dcache-stores                                              (79.99%)
>             23,691      LLC-load-misses           #    0.67% of all LL-cache accesses  (80.01%)
>          3,561,890      LLC-loads                                                     (80.00%)
>
>       30.201065706 seconds time elapsed
>
>        0.005950000 seconds user
>       30.167885000 seconds sys

Interesting, though I'm not really convinced that this DMA memfill
microbenchmark resembles real work loads.

Did you test with something realistic, e.g. storage or networking, too?

Thanks,

        tglx