Re: ICE + XSK ZC - page faults on 6.1 LTS when process exits?

Alasdair McWilliam <alasdair.mcwilliam@xxxxxxxxxxx> · Thu, 5 Sep 2024 13:50:07 +0100

On 04/09/2024 11:30, Maciej Fijalkowski wrote:
> On Mon, Sep 02, 2024 at 04:09:33PM +0000, Alasdair McWilliam wrote:
>> Good evening,
>>
>> Looks like commit a62c50545b4d is the culprit.
>>
>> I've produced a production-grade build of kernel 6.1.95 with commit
>> a62c50545b4d backed out. Seems I can no longer trigger the fault. I can
>> kill -9 the process while pushing 50Gbps / 14Mpps and the process is
>> just restarted and resumes like it should.
>>
>> I'm going to back out the same commit from 6.1.106 for our production
>> builds and verify that fixes the issue there too.
>>
>> Can you advise if this will be reversed in future commits, or if you
>> have an alternate fix in the wings?
> 
> We've been working recently on somewhat related issues and it looks like
> not every commit from [0] has been backported.
> 
> $ git log --oneline v6.1.103..v6.1.104 drivers/net/ethernet/intel/ice/
> 5a80b682e3e1 ice: add missing WRITE_ONCE when clearing ice_rx_ring::xdp_prog
> 8782f0fcb19d ice: replace synchronize_rcu with synchronize_net
> 15115033f056 ice: don't busy wait for Rx queue disable in ice_qp_dis()
> 3dbc58774e58 ice: respect netif readiness in AF_XDP ZC related ndo's
> 
> can you apply the rest of it on top of 6.1.107 and see the result?

The first one I've attempted doesn't apply cleanly to 6.1.107.

Eg: d59227179949 ("ice: modify error handling when setting XSK pool in
ndo_bpf"). The above looks to have been based on code from around 6.8 or
6.9 where the makeup of routines like ice_qp_ena() has changed. Looks
like this happened around a292ba981324 ("ice: make ice_vsi_cfg_txq()
static").

Should I try and apply a292ba981324 as well?