Re: [PATCH v2 0/6] mm/devm_memremap_pages: Fix page release race

Jane Chu <jane.chu@xxxxxxxxxx> · Thu, 16 May 2019 09:45:41 -0700

    Hi,

I'm able to reproduce the panic below by running two sets of ndctl
commands that actually serve legitimate purpose in parallel (unlike
the brute force experiment earlier), each set in a indefinite loop.
This time it takes about an hour to panic.  But I gather the cause
is probably the same: I've overlapped ndctl commands on the same
region.  

Could we add a check in nd_ioctl(), such that if there is
an ongoing ndctl command on a region, subsequent ndctl request
will fail immediately with something to the effect of EAGAIN?
The rationale being that kernel should protect itself against
user mistakes.

Also, sensing the subject fix is for a different problem, and has been
verified, I'm happy to see it in upstream, so we have a better 
code base to digger deeper in terms of how the destructive ndctl
commands interacts to typical mission critical applications, include
but not limited to rdma.

thanks,
-jane

    On 5/14/2019 2:18 PM, Jane Chu wrote:

    On
      5/14/2019 12:04 PM, Dan Williams wrote:

      On Tue, May 14, 2019 at 11:53 AM Jane Chu
        <jane.chu@xxxxxxxxxx> wrote:

        On 5/13/2019 12:22 PM, Logan Gunthorpe
          wrote:

          On 2019-05-08 11:05 a.m., Logan Gunthorpe wrote:

          On 2019-05-07 5:55 p.m., Dan Williams wrote:

          Changes since v1 [1]:

          - Fix a NULL-pointer deref crash in pci_p2pdma_release()
          (Logan)

          - Refresh the p2pdma patch headers to match the format of
          other p2pdma

              patches (Bjorn)

          - Collect Ira's reviewed-by

          [1]:
https://lore.kernel.org/lkml/155387324370.2443841.574715745262628837.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

          This series looks good to me:

          Reviewed-by: Logan Gunthorpe <logang@xxxxxxxxxxxx>

          However, I haven't tested it yet but I intend to later this
          week.

          I've tested libnvdimm-pending which includes this series on my
          setup and

          everything works great.

          Just wondering in a difference scenario where pmem pages are
          exported to

          a KVM guest, and then by mistake the user issues "ndctl
          destroy-namespace -f",

          will the kernel wait indefinitely until the user figures out
          to kill the guest

          and release the pmem pages?

        It depends on whether the pages are pinned. Typically DAX memory

        mappings assigned to a guest are not pinned in the host and can
        be

        invalidated at any time. The pinning only occurs with VFIO and

        device-assignment which isn't the common case, especially since
        that

        configuration is blocked by fsdax. However, with devdax, yes you
        can

        arrange for the system to go into an indefinite wait.

        This somewhat ties back to the get_user_pages() vs DAX debate.
        The

        indefinite stall issue with device-assignment could be addressed
        with

        a requirement to hold a lease and expect that a lease revocation
        event

        may escalate to SIGKILL in response to 'ndctl
        destroy-namespace'. The

        expectation with device-dax is that it is already a raw
        interface with

        pointy edges and caveats, but I would not be opposed to
        introducing a

        lease semantic.

      Thanks for the quick response Dan.

      I am not convinced that the get_user_pages() vs FS-DAX dilemma is
      a perfect

      comparison to "ndctl destroy-namespace -f" vs namespace-is-busy
      dilemma.

      Others might disagree with me, I thought that there is no risk of
      panic

      if we fail "ndctl destroy-namespace -f" to honor a clean shutdown
      of the

      user application. Also, both actions are on the same host, so in
      theory

      the admin could shutdown the application before attempt a
      destructive

      action.

      By allowing 'opposite' actions in competition with each other at
      fine

      granularity, there is potential for panic in general, not
      necessarily with

      pinned page I guess.  I just ran an experiment and panic'd the
      system.

      So, as Optane DCPMEM is generally for server/cloud deployment, and
      as

      RAS is a priority for server over administrative commands, to
      allow

      namespace management command to panic kernel is not an option?

      Here is my stress experiment -

        Start out with ./create_nm.sh to create as many 48G devdax
      namespaces

      as possible. Once that's completed, firing up 6 actions in quick

      successions in below order:

        -> ndctl destroy-namespace all -f

        -> ./create_nm.sh

        -> ndctl destroy-namespace all -f

        -> ./create_nm.sh

        -> ndctl destroy-namespace all -f

        -> ./create_nm.sh

      ==========  console message =======

      Kernel 5.1.0-rc7-next-20190501-libnvdimm-pending on an x86_64

      ban25uut130 login: [ 1620.866813] BUG: kernel NULL pointer
      dereference, address: 0000000000000020

      [ 1620.874585] #PF: supervisor read access in kernel mode

      [ 1620.880319] #PF: error_code(0x0000) - not-present page

      [ 1620.886052] PGD 0 P4D 0

      [ 1620.888879] Oops: 0000 [#1] SMP NOPTI

      [ 1620.892964] CPU: 19 PID: 5611 Comm: kworker/u130:3 Tainted:
      G        W         5.1.0-rc7-next-20190501-libnvdimm-pending #5

      [ 1620.905389] Hardware name: Oracle Corporation ORACLE SERVER
      X8-2L/ASM,MTHRBD,2U, BIOS 52020101 05/07/2019

      [ 1620.916069] Workqueue: events_unbound async_run_entry_fn

      [ 1620.921997] RIP: 0010:klist_put+0x1b/0x6c

      [ 1620.926471] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3
      55 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83
      e4 fe 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74
      10 48 8b 03 a8 01 74 02

      [ 1620.947427] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246

      [ 1620.953258] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX:
      0000000000000000

      [ 1620.961223] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI:
      0000000000000000

      [ 1620.969185] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09:
      ffffffffbb54b3c0

      [ 1620.977150] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12:
      0000000000000000

      [ 1620.985116] R13: ffff94d18dcfd000 R14: 0000000000000001 R15:
      ffff955000caf140

      [ 1620.993081] FS:  0000000000000000(0000)
      GS:ffff95679f4c0000(0000) knlGS:0000000000000000

      [ 1621.002113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

      [ 1621.008524] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4:
      00000000007606e0

      [ 1621.016487] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000

      [ 1621.024450] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400

      [ 1621.032413] PKRU: 55555554

      [ 1621.035433] Call Trace:

      [ 1621.038161]  klist_del+0xe/0x10

      [ 1621.041667]  device_del+0x8a/0x2c9

      [ 1621.045463]  ? __switch_to_asm+0x34/0x70

      [ 1621.049840]  ? __switch_to_asm+0x40/0x70

      [ 1621.054220]  device_unregister+0x44/0x4f

      [ 1621.058603]  nd_async_device_unregister+0x22/0x2d [libnvdimm]

      [ 1621.065016]  async_run_entry_fn+0x47/0x15a

      [ 1621.069588]  process_one_work+0x1a2/0x2eb

      [ 1621.074064]  worker_thread+0x1b8/0x26e

      [ 1621.078239]  ? cancel_delayed_work_sync+0x15/0x15

      [ 1621.083490]  kthread+0xf8/0xfd

      [ 1621.086897]  ? kthread_destroy_worker+0x45/0x45

      [ 1621.091954]  ret_from_fork+0x1f/0x40

      [ 1621.095944] Modules linked in: xt_REDIRECT xt_nat xt_CHECKSUM
      iptable_mangle xt_MASQUERADE xt_conntrack ipt_REJECT
      nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
      ip6table_filter iptable_filter scsi_transport_iscsi ip6table_nat
      ip6_tables iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
      nf_defrag_ipv4 vfat fat skx_edac intel_powerclamp coretemp
      kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul
      ghash_clmulni_intel iTCO_wdt iTCO_vendor_support aesni_intel
      ipmi_si crypto_simd cryptd glue_helper ipmi_devintf
      ipmi_msghandler sg pcspkr dax_pmem_compat device_dax dax_pmem_core
      i2c_i801 pcc_cpufreq lpc_ich ioatdma wmi nfsd auth_rpcgss nfs_acl
      lockd grace sunrpc ip_tables xfs libcrc32c nd_pmem nd_btt sr_mod
      cdrom sd_mod mgag200 drm_kms_helper syscopyarea crc32c_intel
      sysfillrect sysimgblt fb_sys_fops ttm megaraid_sas drm igb ahci
      libahci ptp libata pps_core dca i2c_algo_bit nfit libnvdimm uas
      usb_storage dm_mirror dm_region_hash dm_log dm_mod

      [ 1621.189449] CR2: 0000000000000020

      [ 1621.193169] ---[ end trace 7c3f7029ef24aa5a ]---

      [ 1621.305383] RIP: 0010:klist_put+0x1b/0x6c

      [ 1621.309860] Code: 48 8b 43 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3
      55 48 89 e5 41 56 41 89 f6 41 55 41 54 53 4c 8b 27 48 89 fb 49 83
      e4 fe 4c 89 e7 <4d> 8b 6c 24 20 e8 3a d4 01 00 45 84 f6 74
      10 48 8b 03 a8 01 74 02

      [ 1621.330809] RSP: 0018:ffffb1a5e6727da0 EFLAGS: 00010246

      [ 1621.336642] RAX: ffff956796604c00 RBX: ffff956796604c28 RCX:
      0000000000000000

      [ 1621.344606] RDX: ffff955000c2c4d8 RSI: 0000000000000001 RDI:
      0000000000000000

      [ 1621.352570] RBP: ffffb1a5e6727dc0 R08: 0000000000000002 R09:
      ffffffffbb54b3c0

      [ 1621.360533] R10: ffffb1a5e6727d40 R11: fefefefefefefeff R12:
      0000000000000000

      [ 1621.368496] R13: ffff94d18dcfd000 R14: 0000000000000001 R15:
      ffff955000caf140

      [ 1621.376460] FS:  0000000000000000(0000)
      GS:ffff95679f4c0000(0000) knlGS:0000000000000000

      [ 1621.385490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

      [ 1621.391902] CR2: 0000000000000020 CR3: 0000009fa100a005 CR4:
      00000000007606e0

      [ 1621.399867] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000

      [ 1621.407830] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400

      [ 1621.415793] PKRU: 55555554

      [ 1621.418814] Kernel panic - not syncing: Fatal exception

      [ 1621.424740] Kernel Offset: 0x39000000 from 0xffffffff81000000
      (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

      [ 1621.550711] ---[ end Kernel panic - not syncing: Fatal
      exception ]---

      Thanks!

      -jane

      _______________________________________________

      Linux-nvdimm mailing list

      Linux-nvdimm@xxxxxxxxxxxx

      https://lists.01.org/mailman/listinfo/linux-nvdimm