kernel panic caused by recent changes in fs/cifs

Chenglong Tang <chenglongtang@xxxxxxxxxx> · Thu, 11 Apr 2024 16:32:17 -0700

Hi, developers,

This is Chenglong Tang from the Google Container Optimized OS team. We
recently received a kernel panic bug from the customers regarding
cifs.

I sent this email again with more detailed investigation we did these
days and also because the previous one was sent to the wrong address.

Here are some steps we did in order to reproduce the issue:

Deploy GKE Cluster with the same version as the customer v1.26.13-gke.1144000
Deploy SMB CSI Driver Daemonset:
https://github.com/kubernetes-csi/csi-driver-smb/blob/master/docs/install-csi-driver-v1.12.0.md
Create SMB Server on the same cluster:
https://github.com/kubernetes-csi/csi-driver-smb/tree/master/deploy/example/smb-provisioner
Create Pod to create Client to create a PersistentVolume on the Samba
Server in (3): https://github.com/kubernetes-csi/csi-driver-smb/blob/master/deploy/example/deployment.yaml
Wait and observe to see if there are any Kernel Panics in VM Instance
logs or Node that is running the Client pod goes into NotReady state.

Unfortunately (5) did not occur, therefore we were unable to reproduce
locally. I am not sure why in the customer's environment this issue is
present.

DS for CSI SMB node is:

image: gcr.io/iaas-gcr-reg-prd-ad3d/vendor/sig-storage/smbplugin:v1.12.0
and CSI SMB Controller

image: gcr.io/iaas-gcr-reg-prd-ad3d/vendor/sig-storage/smbplugin:v1.12.0
image: gcr.io/iaas-gcr-reg-prd-ad3d/vendor/sig-storage/livenessprobe:v2.11.0
image: gcr.io/iaas-gcr-reg-prd-ad3d/vendor/sig-storage/csi-provisioner:v3.6.0
Based on this it looks like this a pretty outdated version of the

SMB CSI Controller - since 1.12.0 was released Aug 11, 2023

However, our systems should be fault tolerant and not kernel panicking
/ crashing - we should investigate this as many other customers might
run into the same issue.

We are not able to swap to newer versions of kernels(currently in
v5.10 and 5.15).

The customer said the problem occurred because of the kernel update
from v5.15.133 to v5.15.146.

Here are the recent kernel changes we make in cifs:

ded3cfd smb: client: fix OOB in smbCalcSize() by Paulo Alcantara · 4 months ago
bfd18c0 smb: client: fix OOB in SMB2_query_info_init() by Paulo
Alcantara · 4 months ago
f47e3f6 smb: client: fix OOB in smb2_query_reparse_point() by Paulo
Alcantara · 4 months ago
fd3951b smb: client: fix NULL deref in asn1_ber_decoder() by Paulo
Alcantara · 4 months ago
6bbeb39 ksmbd: fix wrong name of SMB2_CREATE_ALLOCATION_SIZE by Namjae
Jeon · 4 months ago
e5071ae smb: client: fix potential NULL deref in parse_dfs_referrals()
by Paulo Alcantara · 4 months ago
d2bafe8 cifs: Fix non-availability of dedup breaking generic/304 by
David Howells · 4 months ago
bb08df4 smb3: fix caching of ctime on setxattr by Steve French · 5 months ago
b4329a3 smb3: fix touch -h of symlink by Steve French · 6 months ago
4968c2a cifs: fix check of rc in function generate_smb3signingkey by
Ekaterina Esina · 5 months ago
8d725bf cifs: spnego: add ';' in HOST_KEY_LEN by Anastasia Belova · 5 months ago
8e3cdab smb3: correct places where ENOTSUPP is used instead of
preferred EOPNOTSUPP by Steve French · 7 months ago

Our first diagnosis is that we think the problem might be caused by
memory corruption. From the assembly code we can see that

chenglongtang@vm01:~/cos/src/third_party/kernel/v5.15$ echo "Code: d7
49 89 f4 48 89 fb e8 1c 42 fb d3 48 83 f8 03 72 5f 49 89 c5 8a 03 3c
5c 74 04 3c 2f 75 52 49 8b 3c 24 48 8b 05 9e d6 04 00 <48> 8b 30 e8 56
45 fb d3 85 c0 75 64 48 89 df be c0 0c 00 00 e8 55" |
scripts/decodecode
Code: d7 49 89 f4 48 89 fb e8 1c 42 fb d3 48 83 f8 03 72 5f 49 89 c5
8a 03 3c 5c 74 04 3c 2f 75 52 49 8b 3c 24 48 8b 05 9e d6 04 00 <48> 8b
30 e8 56 45 fb d3 85 c0 75 64 48 89 df be c0 0c 00 00 e8 55
All code
========
   0:   d7                      xlat   %ds:(%rbx)
   1:   49 89 f4                mov    %rsi,%r12
   4:   48 89 fb                mov    %rdi,%rbx
   7:   e8 1c 42 fb d3          call   0xffffffffd3fb4228
   c:   48 83 f8 03             cmp    $0x3,%rax
  10:   72 5f                   jb     0x71
  12:   49 89 c5                mov    %rax,%r13
  15:   8a 03                   mov    (%rbx),%al
  17:   3c 5c                   cmp    $0x5c,%al
  19:   74 04                   je     0x1f
  1b:   3c 2f                   cmp    $0x2f,%al
  1d:   75 52                   jne    0x71
  1f:   49 8b 3c 24             mov    (%r12),%rdi
  23:   48 8b 05 9e d6 04 00    mov    0x4d69e(%rip),%rax        # 0x4d6c8
  2a:*  48 8b 30                mov    (%rax),%rsi              <--
trapping instruction
  2d:   e8 56 45 fb d3          call   0xffffffffd3fb4588 #if
(unlikely(strcmp(cp->charset, cache_cp->charset))) {
  32:   85 c0                   test   %eax,%eax
  34:   75 64                   jne    0x9a
  36:   48 89 df                mov    %rbx,%rdi
  39:   be c0 0c 00 00          mov    $0xcc0,%esi
  3e:   e8                      .byte 0xe8
  3f:   55                      push   %rbp
The trapping happened before the call of 0xffffffffd3fb4588 (strcmp)
and mov (%rax),%rsi is for the second argument of strcmp. And from the
backtrace I posted, we can see that rax is ffffffff00000000, which was
corrupted and that caused the segmentation fault.

We checked that it's impossible for cache_cp to be NULL and
cache_cp->charset should have a value because load_nfs_default will
definitely return a non-NULL value. That's why we believe it should be
caused by some changes related to the memory free/allocation that
corrupted the static table.

I can't narrow down anymore and am still working on it. Feel free to
share thoughts with us.

Best,

Chenglong