> On Nov 11, 2023, at 7:42 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Fri, 2023-11-10 at 11:28 -0500, Chuck Lever wrote: >> I've chased down the reported failures with krb5i/p. Essentially >> the DRC was broken for krb5i/p because of the extra crap those >> security flavors stick before and after the NFS headers. > > Huh, so the reported failures were due to retransmissions? Good question. This cover letter leaves out some of the story. I noticed test failures when running the git regression suite with 12 threads on a sec=krb5i mount point, proto=rdma. The failure was 100% reproducible. I saw some retransmits in the mountstats error counters. The initial trigger of the retransmissions is GSS sequence window underruns, which are rather frequent with this test set-up. The server is mandated to drop the request and connection after each detected underrun. At first I thought that getting rid of the underruns would fix the issue, but then realized that that's not possible to do completely. Then I realized that a simple retransmit is not supposed to result in vastly different application behavior (at least most of the time) and so dug into that. The hole I crawled out of at the end was that the DRC was busted. And not just one bug, but three separate issues. Once all three are addressed, the test passes on krb5i and krb5p. >> These patches do address bugs in stable kernels, but they will need >> to be massaged, tested, and applied by hand to get there. Voting now >> open on whether it's worth the trouble for us to do it now, or wait >> until someone complains about a particular LTS problem. I haven't been able to successfully build a kernel older than about v5.18 on my Fedora 38 systems. I'm thinking that, in general, we could test stable backports using kdevops with older Fedora vagrant boxes. But I haven't gotten the full kdevops environment to run here yet, and thus don't have a git regression workflow. And I guess we don't have RDMA working in this environment yet either. (siw might not work well or at all on older kernels). Maybe the connection losses could be triggered instead with disconnect injection. >> --- >> >> Chuck Lever (3): >> NFSD: Fix "start of NFS reply" pointer passed to nfsd_cache_update() >> NFSD: Update nfsd_cache_append() to use xdr_stream >> NFSD: Fix checksum mismatches in the duplicate reply cache >> >> >> fs/nfsd/cache.h | 4 +-- >> fs/nfsd/nfscache.c | 87 +++++++++++++++++++++++++++------------------- >> fs/nfsd/nfssvc.c | 14 ++++++-- >> 3 files changed, 65 insertions(+), 40 deletions(-) >> >> -- >> Chuck Lever >> > > Wow, nice work tracking those down. I'm sure that wasn't easy! The hardest part was not giving up. > Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx> Much appreciated! -- Chuck Lever