Re: [PATCH 2/2 v2] scsi: target: tcmu: Fix crash in tcmu_flush_dcache_range on ARM

Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx> · Thu, 18 Jun 2020 17:41:39 +0200

On 2020-06-18 17:00, Mike Christie wrote:
On 6/18/20 8:16 AM, Bodo Stroesser wrote:
This patch fixes the following crash
(see 
https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=208045__;!!GqivPVa7Brio!O7JgZIL3VPAzIqwJfZjL48y8M90K3HXv3pUVeoCzZ-vXovCpx5g7xMg-z5aAiVXVtkfE$ 
)

  Process iscsi_trx (pid: 7496, stack limit = 0x0000000010dd111a)
  CPU: 0 PID: 7496 Comm: iscsi_trx Not tainted 4.19.118-0419118-generic
         #202004230533
  Hardware name: Greatwall QingTian DF720/F601, BIOS 601FBE20 Sep 26 2019
  pstate: 80400005 (Nzcv daif +PAN -UAO)
  pc : flush_dcache_page+0x18/0x40
  lr : is_ring_space_avail+0x68/0x2f8 [target_core_user]
  sp : ffff000015123a80
  x29: ffff000015123a80 x28: 0000000000000000
  x27: 0000000000001000 x26: ffff000023ea5000
  x25: ffffcfa25bbe08b8 x24: 0000000000000078
  x23: ffff7e0000000000 x22: ffff000023ea5001
  x21: ffffcfa24b79c000 x20: 0000000000000fff
  x19: ffff7e00008fa940 x18: 0000000000000000
  x17: 0000000000000000 x16: ffff2d047e709138
  x15: 0000000000000000 x14: 0000000000000000
  x13: 0000000000000000 x12: ffff2d047fbd0a40
  x11: 0000000000000000 x10: 0000000000000030
  x9 : 0000000000000000 x8 : ffffc9a254820a00
  x7 : 00000000000013b0 x6 : 000000000000003f
  x5 : 0000000000000040 x4 : ffffcfa25bbe08e8
  x3 : 0000000000001000 x2 : 0000000000000078
  x1 : ffffcfa25bbe08b8 x0 : ffff2d040bc88a18
  Call trace:
   flush_dcache_page+0x18/0x40
   is_ring_space_avail+0x68/0x2f8 [target_core_user]
   queue_cmd_ring+0x1f8/0x680 [target_core_user]
   tcmu_queue_cmd+0xe4/0x158 [target_core_user]
   __target_execute_cmd+0x30/0xf0 [target_core_mod]
   target_execute_cmd+0x294/0x390 [target_core_mod]
   transport_generic_new_cmd+0x1e8/0x358 [target_core_mod]
   transport_handle_cdb_direct+0x50/0xb0 [target_core_mod]
   iscsit_execute_cmd+0x2b4/0x350 [iscsi_target_mod]
   iscsit_sequence_cmd+0xd8/0x1d8 [iscsi_target_mod]
   iscsit_process_scsi_cmd+0xac/0xf8 [iscsi_target_mod]
   iscsit_get_rx_pdu+0x404/0xd00 [iscsi_target_mod]
   iscsi_target_rx_thread+0xb8/0x130 [iscsi_target_mod]
   kthread+0x130/0x138
   ret_from_fork+0x10/0x18
  Code: f9000bf3 aa0003f3 aa1e03e0 d503201f (f9400260)
  ---[ end trace 1e451c73f4266776 ]---

The solution is based on patch:

   "scsi: target: tcmu: Optimize use of flush_dcache_page"

which restricts the use of tcmu_flush_dcache_range() to
addresses from vmalloc'ed areas only.

This patch now replaces the virt_to_page() call in
tcmu_flush_dcache_range() - which is wrong for vmalloced addrs -
by vmalloc_to_page().

The patch was tested on ARM with kernel 4.19.118 and 5.7.2

Signed-off-by: Bodo Stroesser <bstroesser@xxxxxxxxxxxxxx>
Tested-by: JiangYu <lnsyyj@xxxxxxxxxxx>
Tested-by: Daniel Meyerholt <dxm523@xxxxxxxxx>
---
  drivers/target/target_core_user.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/target/target_core_user.c 
b/drivers/target/target_core_user.c
index a65e9671ae7a..835d3001cb0e 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -601,7 +601,7 @@ static inline void tcmu_flush_dcache_range(void 
*vaddr, size_t size)
      size = round_up(size+offset, PAGE_SIZE);
      while (size) {
-        flush_dcache_page(virt_to_page(start));
+        flush_dcache_page(vmalloc_to_page(start));
          start += PAGE_SIZE;
          size -= PAGE_SIZE;
      }

For this bug we only need to change the flush in is_ring_space_avail 
right? It's what is accessing the mb which is vmalloc'd.
No, is_ring_space_avail was just the first caller of
tcmu_flush_dcache_range for vmalloc'ed pages, so it crashed there.

The entiere address range exposed to userspace via uio consists of
two parts:
1) mb + command ring are vmalloc'ed in a single vzalloc call
   during initialization of a tcmu device.
2) the data area which is allocated page by page calling
   alloc_page()

The second part is handled by (scatter|gather)_data_area. For the
calls from these routines I think usage of virt_to_page in
tcmu_flush_dcache_range was fine. But patch number 1 of this
series replaced these called with direct calls to flush_dcache_page.

So all remaining calls to tcmu_flush_dcache_range after patch 1
are calls for vmalloc'ed addresses. Therefore after patch 1 we can
safely replace virt_to_page with vmalloc_to_page.

So it turned out that patch 1, which in the beginning was thought
to be an optimization, now also simplifies the crash fix.


Is it reccommended to call vmalloc_to_page on a page we get with 
alloc_page and is the mm then always going to do the right thing for us 
(I do not know the mm code well and just quickly scanned the 
Documentation and comments, but could not find anything), or could we 
hit similar issues where we are using the wrong call on different types 
of allocated memory down the line?