Re: General protection faults in cifs_reconnect with Azure Files

Luke Carrier <luke@xxxxxxxxxx> · Sat, 6 Jul 2019 19:23:27 +0100

Hi Pavel and Steve,

We've been able to reproduce the cifs_reconnect oops in all 5.2-rc7
through rc4 and 5.2-rc1 and again in 5.1.

We're going to try 5.0.27 with the nohandlecache option (since 5.1
suffers from the cifs_reconnect fault) and will report back.

Thanks for your help, much appreciated.
Luke

On Sat, 6 Jul 2019 at 17:49, Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote:
>
> сб, 6 июл. 2019 г. в 05:32, Luke Carrier <luke@xxxxxxxxxx>:
> >
> > Hi Pavel,
> >
> > Thanks for the prompt response.
> >
> > Under the 5.1.0 kernel we observed slightly different oopses, again
> > during reconnection:
> >
> > [   64.118926] CIFS VFS: Send error in SessSetup = -11
> > [  107.638988] CIFS VFS: Send error in SessSetup = -11
> > [  138.487142] CIFS VFS: Send error in SessSetup = -11
> > [  148.599239] CIFS VFS: Send error in SessSetup = -11
> > [  186.685967] CIFS VFS: Send error in SessSetup = -11
> > [  193.450272] CIFS VFS: Send error in SessSetup = -11
> > [  364.918259] INFO: task kworker/3:2:271 blocked for more than 120 seconds.
> > [  364.924765]       Not tainted 5.1.0-050100-generic #201905052130
> > [  364.931458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  364.939006] kworker/3:2     D    0   271      2 0x80000000
> > [  364.939043] Workqueue: cifsiod smb2_reconnect_server [cifs]
> > [  364.939049] Call Trace:
> > [  364.939057]  __schedule+0x2d3/0x840
> > [  364.939061]  ? wake_up_klogd+0x34/0x40
> > [  364.939063]  schedule+0x2c/0x70
> > [  364.939066]  schedule_preempt_disabled+0xe/0x10
> > [  364.939071]  __mutex_lock.isra.10+0x2e4/0x4c0
> > [  364.939074]  ? printk+0x58/0x6f
> > [  364.939079]  __mutex_lock_slowpath+0x13/0x20
> > [  364.939081]  mutex_lock+0x2c/0x30
> > [  364.939098]  smb2_reconnect.part.22+0xea/0x7c0 [cifs]
> > [  364.939103]  ? lock_timer_base+0x6b/0x90
> > [  364.939116]  ? cifs_put_tcon.part.45+0x162/0x1e0 [cifs]
> > [  364.939149]  smb2_reconnect_server+0x190/0x2a0 [cifs]
> > [  364.939154]  process_one_work+0x20f/0x410
> > [  364.939155]  worker_thread+0x34/0x400
> > [  364.939157]  kthread+0x120/0x140
> > [  364.939158]  ? process_one_work+0x410/0x410
> > [  364.939160]  ? __kthread_parkme+0x70/0x70
> > [  364.939161]  ret_from_fork+0x35/0x40
> > [  364.939184] INFO: task php-fpm7.2:1913 blocked for more than 120 seconds.
> > [  364.944595]       Not tainted 5.1.0-050100-generic #201905052130
> > [  364.949987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  364.957078] php-fpm7.2      D    0  1913   1325 0x00000000
> > [  364.957079] Call Trace:
> > [  364.957083]  __schedule+0x2d3/0x840
> > [  364.957086]  ? __d_alloc+0x122/0x1d0
> > [  364.957090]  schedule+0x2c/0x70
> > [  364.957094]  d_alloc_parallel+0x3df/0x490
> > [  364.957096]  ? __switch_to_asm+0x34/0x70
> > [  364.957100]  ? wake_up_q+0x80/0x80
> > [  364.957105]  __lookup_slow+0x71/0x150
> > [  364.957107]  lookup_slow+0x3a/0x60
> > [  364.957108]  walk_component+0x1bf/0x330
> > [  364.957110]  ? link_path_walk.part.42+0x61/0x540
> > [  364.957112]  path_lookupat.isra.45+0x6d/0x220
> > [  364.957114]  filename_lookup.part.61+0xa0/0x170
> > [  364.957118]  ? __check_object_size+0x166/0x192
> > [  364.957121]  ? strncpy_from_user+0x56/0x1b0
> > [  364.957123]  user_path_at_empty+0x3e/0x50
> > [  364.957125]  vfs_statx+0x76/0xe0
> > [  364.957126]  __do_sys_newstat+0x3d/0x70
> > [  364.957129]  ? handle_mm_fault+0xe1/0x210
> > [  364.957133]  ? __do_page_fault+0x259/0x4b0
> > [  364.957136]  __x64_sys_newstat+0x16/0x20
> > [  364.957141]  do_syscall_64+0x5a/0x110
> > [  364.957145]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  364.957165] RIP: 0033:0x7f3a2ff5f775
> > [  364.957171] Code: Bad RIP value.
> > [  364.957171] RSP: 002b:00007ffdada63d48 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000004
> > [  364.957173] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3a2ff5f775
> > [  364.957173] RDX: 00007ffdada63ed0 RSI: 00007ffdada63ed0 RDI: 00007f3a16eeff78
> > [  364.957174] RBP: 00007f3a16eeff78 R08: 000000000000ffff R09: 0000000000000020
> > [  364.957174] R10: 00007f3a2ffdd2e0 R11: 0000000000000246 R12: 0000000000000000
> > [  364.957175] R13: 00007ffdada63ed0 R14: 00000000fffffffb R15: 0000000000000000
> > [  364.957179] INFO: task php-fpm7.2:1926 blocked for more than 120 seconds.
> > [  364.963025]       Not tainted 5.1.0-050100-generic #201905052130
> > [  364.968886] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  364.976107] php-fpm7.2      D    0  1926   1325 0x00000000
> > [  364.976109] Call Trace:
> > [  364.976113]  __schedule+0x2d3/0x840
> > [  364.976115]  schedule+0x2c/0x70
> > [  364.976119]  schedule_preempt_disabled+0xe/0x10
> > [  364.976123]  __mutex_lock.isra.10+0x2e4/0x4c0
> > [  364.976127]  ? del_timer_sync+0x39/0x40
> > [  364.976130]  __mutex_lock_slowpath+0x13/0x20
> > [  364.976132]  mutex_lock+0x2c/0x30
> > [  364.976188]  smb2_reconnect.part.22+0xea/0x7c0 [cifs]
> > [  364.976210]  ? wait_woken+0x80/0x80
> > [  364.976223]  smb2_plain_req_init+0x3e/0x270 [cifs]
> > [  364.976238]  SMB2_open_init+0x69/0x760 [cifs]
> > [  364.976249]  ? cifs_get_inode_info+0x28f/0xb30 [cifs]
> > [  364.976263]  open_shroot+0x21e/0x4a0 [cifs]
> > [  364.976277]  smb2_query_path_info+0x133/0x1f0 [cifs]
> > [  364.976291]  ? smb2_query_path_info+0x133/0x1f0 [cifs]
> > [  364.976294]  ? _cond_resched+0x19/0x30
> > [  364.976297]  ? kmem_cache_alloc_trace+0x150/0x1d0
> > [  364.976308]  cifs_get_inode_info+0x2df/0xb30 [cifs]
> > [  364.976319]  ? build_path_from_dentry_optional_prefix+0xc4/0x400 [cifs]
> > [  364.976351]  cifs_revalidate_dentry_attr+0xd5/0x360 [cifs]
> > [  364.976362]  cifs_getattr+0x5a/0x1a0 [cifs]
> > [  364.976365]  vfs_getattr_nosec+0x98/0xc0
> > [  364.976366]  vfs_getattr+0x36/0x40
> > [  364.976368]  vfs_statx+0x8d/0xe0
> > [  364.976369]  __do_sys_newstat+0x3d/0x70
> > [  364.976370]  __x64_sys_newstat+0x16/0x20
> > [  364.976372]  do_syscall_64+0x5a/0x110
> > [  364.976374]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  364.976375] RIP: 0033:0x7f3a2ff5f775
> > [  364.976377] Code: Bad RIP value.
> > [  364.976378] RSP: 002b:00007ffdada62ec8 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000004
> > [  364.976379] RAX: ffffffffffffffda RBX: 00007f3a16eeff78 RCX: 00007f3a2ff5f775
> > [  364.976379] RDX: 00007ffdada62ef0 RSI: 00007ffdada62ef0 RDI: 00007ffdada62f80
> > [  364.976380] RBP: 0000000000000021 R08: 0000000000c3dfc8 R09: 000000000000017b
> > [  364.976381] R10: 00007ffdada60e10 R11: 0000000000000246 R12: 00007ffdada62f81
> > [  364.976381] R13: 0000000000000000 R14: 00007ffdada62f99 R15: 00007ffdada62f80
> > [  364.976384] INFO: task php-fpm7.2:1933 blocked for more than 120 seconds.
> > [  364.981953]       Not tainted 5.1.0-050100-generic #201905052130
> > [  364.986915] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  364.993572] php-fpm7.2      D    0  1933   1325 0x80000000
> > [  364.993574] Call Trace:
> > [  364.993578]  __schedule+0x2d3/0x840
> > [  364.993598]  schedule+0x2c/0x70
> > [  364.993600]  schedule_preempt_disabled+0xe/0x10
> > [  364.993601]  __mutex_lock.isra.10+0x2e4/0x4c0
> > [  364.993603]  __mutex_lock_slowpath+0x13/0x20
> > [  364.993605]  mutex_lock+0x2c/0x30
> > [  364.993617]  cifs_mark_open_files_invalid+0x5b/0xa0 [cifs]
> > [  364.993633]  smb2_reconnect.part.22+0x131/0x7c0 [cifs]
> > [  364.993638]  ? kmem_cache_alloc_trace+0x150/0x1d0
> > [  364.993641]  ? char2uni+0x29/0x70 [nls_utf8]
> > [  364.993658]  smb2_plain_req_init+0x3e/0x270 [cifs]
> > [  364.993673]  SMB2_open_init+0x69/0x760 [cifs]
> > [  364.993687]  ? cifs_convert_path_to_utf16+0x7a/0xb0 [cifs]
> > [  364.993701]  smb2_compound_op+0x189/0x1930 [cifs]
> > [  364.993706]  ? prep_new_page+0x99/0x130
> > [  364.993709]  ? get_page_from_freelist+0xeed/0x1350
> > [  364.993711]  ? __alloc_pages_nodemask+0x16a/0x330
> > [  364.993724]  smb2_query_path_info+0xc8/0x1f0 [cifs]
> > [  364.993737]  ? smb2_query_path_info+0xc8/0x1f0 [cifs]
> > [  364.993740]  ? _cond_resched+0x19/0x30
> > [  364.993741]  ? kmem_cache_alloc_trace+0x150/0x1d0
> > [  364.993752]  cifs_get_inode_info+0x2df/0xb30 [cifs]
> > [  364.993753]  ? kmem_cache_alloc+0x15c/0x1c0
> > [  364.993795]  ? build_path_from_dentry_optional_prefix+0x12f/0x400 [cifs]
> > [  364.993819]  cifs_lookup+0xf3/0x7b0 [cifs]
> > [  364.993823]  __lookup_slow+0x9b/0x150
> > [  364.993825]  lookup_slow+0x3a/0x60
> > [  364.993826]  walk_component+0x1bf/0x330
> > [  364.993828]  ? link_path_walk.part.42+0x61/0x540
> > [  364.993830]  path_lookupat.isra.45+0x6d/0x220
> > [  364.993833]  filename_lookup.part.61+0xa0/0x170
> > [  364.993835]  ? __check_object_size+0x166/0x192
> > [  364.993837]  ? strncpy_from_user+0x56/0x1b0
> > [  364.993839]  user_path_at_empty+0x3e/0x50
> > [  364.993841]  vfs_statx+0x76/0xe0
> > [  364.993843]  __do_sys_newstat+0x3d/0x70
> > [  364.993845]  __x64_sys_newstat+0x16/0x20
> > [  364.993847]  do_syscall_64+0x5a/0x110
> > [  364.993850]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  364.993851] RIP: 0033:0x7f3a2ff5f775
> > [  364.993862] Code: Bad RIP value.
> > [  364.993862] RSP: 002b:00007ffdada63d48 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000004
> > [  364.993864] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3a2ff5f775
> > [  364.993865] RDX: 00007ffdada63ed0 RSI: 00007ffdada63ed0 RDI: 00007f3a16eeff78
> > [  364.993866] RBP: 00007f3a16eeff78 R08: 000000000000ffff R09: 0000000000000020
> > [  364.993867] R10: 00007f3a2ffdd2e0 R11: 0000000000000246 R12: 0000000000000000
> > [  364.993867] R13: 00007ffdada63ed0 R14: 00000000fffffffb R15: 0000000000000000
> >
> > We had been experiencing different faults under 4.18 (the current stable in
> > Ubuntu 18.04 LTS), but had never seen a call trace that directly implicated
> > memory management. If it's worth sharing these oopses I can, but I suspect
> > it's better to work backwards through the Ubuntu mainline kernel images to
> > try and isolate this fault than to dig through fairly ancient kernels?
> >
> > Cheers,
> > Luke
> >
>
> Hi Luke,
>
> The issue you are observing on the 5.1 kernel seems similar to the
> issue in the parallel thread in the mailing list. So, this might be a
> deadlock problem. Please try to mount with "nohandlecache" mount
> option to see if it helps.
>
> It also worth to mention, that the problem reported on 5.2-rc kernel
> was a different one and we still don't know what may be a reason for
> that.
>
> About digging old kernels - I agree, better not to do it. It is easier
> to debug the current kernel and to work backward through the images.
>
> --
> Best regards,
> Pavel Shilovsky