Re: ✗ Fi.CI.IGT: failure for drm/i915: stop using swiotlb (rev6)

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Thu, 28 Jul 2022 17:07:08 +0100

On 28/07/2022 16:54, Robert Beckett wrote:
On 28/07/2022 15:03, Tvrtko Ursulin wrote:

On 28/07/2022 09:01, Patchwork wrote:

[snip]

        Possible regressions

  * igt@gem_mmap_offset@clear:
      o shard-iclb: PASS
<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11946/shard-iclb6/igt@gem_mmap_offset@xxxxxxxxxx> 

        -> INCOMPLETE
<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_106589v6/shard-iclb1/igt@gem_mmap_offset@xxxxxxxxxx> 

What was supposed to be a simple patch.. a storm of errors like:

yeah, them's the breaks sometimes ....

  DMAR: ERROR: DMA PTE for vPFN 0x3d00000 already set (to 2fd7ff003 
not 2fd7ff003)
  ------------[ cut here ]------------
  WARNING: CPU: 6 PID: 1254 at drivers/iommu/intel/iommu.c:2278 
__domain_mapping.cold.93+0x32/0x39<>
  Modules linked in: vgem drm_shmem_helper snd_hda_codec_hdmi 
snd_hda_codec_realtek snd_hda_cod>
  CPU: 6 PID: 1254 Comm: gem_mmap_offset Not tainted 
5.19.0-rc8-Patchwork_106589v6-g0e9c43d76a14+ #>
  Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U 
DDR4 SODIMM PD RVP TLC, BIOS >
  RIP: 0010:__domain_mapping.cold.93+0x32/0x39
  Code: fe 48 c7 c7 28 32 37 82 4c 89 5c 24 08 e8 e4 61 fd ff 8b 05 bf 
8e c9 00 4c 8b 5c 24 08 85 c>
  RSP: 0000:ffffc9000037f9c0 EFLAGS: 00010202
  RAX: 0000000000000004 RBX: ffff8881117b4000 RCX: 0000000000000001
  RDX: 0000000000000000 RSI: ffffffff82320b25 RDI: 00000000ffffffff
  RBP: 0000000000000001 R08: 0000000000000000 R09: c0000000ffff7fff
  R10: 0000000000000001 R11: 00000000002fd7ff R12: 00000002fd7ff003
  R13: 0000000000076c01 R14: ffff8881039ee800 R15: 0000000003d00000
  FS:  00007f2863c1d700(0000) GS:ffff88849fd00000(0000) 
knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f2692c53000 CR3: 000000011c440006 CR4: 0000000000770ee0
  PKRU: 55555554
  Call Trace:
   &lt;TASK&gt;
   intel_iommu_map_pages+0xb7/0xe0
   __iommu_map+0xe0/0x310
   __iommu_map_sg+0xa2/0x140
   iommu_dma_map_sg+0x2ef/0x4e0
   __dma_map_sg_attrs+0x64/0x70
   dma_map_sg_attrs+0x5/0x20
   i915_gem_gtt_prepare_pages+0x56/0x70 [i915]
   shmem_get_pages+0xe3/0x360 [i915]
   ____i915_gem_object_get_pages+0x32/0x100 [i915]
   __i915_gem_object_get_pages+0x8d/0xa0 [i915]
   vm_fault_gtt+0x3d0/0x940 [i915]
   ? ptlock_alloc+0x15/0x40
   ? rt_mutex_debug_task_free+0x91/0xa0
   __do_fault+0x30/0x180
   do_fault+0x1c4/0x4c0
   __handle_mm_fault+0x615/0xbe0
   handle_mm_fault+0x75/0x1c0
   do_user_addr_fault+0x1e7/0x670
   exc_page_fault+0x62/0x230
   asm_exc_page_fault+0x22/0x30

No idea. Maybe try CI kernel config on your Tigerlake?

I have an idea of what could be happening:

The warning is due to a pte already existing. We can see from the 
warning that it is the same value, which indicates that the same page 
has been mapped to the same iova before.

This map shrink loop will keep mapping the same sg, shrinking if it 
fails to hopefully free up iova space.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_gtt.c?h=v5.19-rc8#n32 

If we now look at the intel iommu driver's mapping function:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/intel/iommu.c?h=v5.19-rc8#n2248 

If that -ENOMEM loop breaking return is hit (presumably running out of 
pte space, though I have not delved deeper), then it will return back up 
the stack, eventually returning 0 from dma_map_sg_attrs() indicating the 
error. This will cause a shrink and retry.

The problem is that the iommu does not undo it's partial mapping on 
error. So the next time round, it will map the same page to the same 
address giving the same pte encoding, which would give the warning 
observed.

I would need to get some time to try to repro and debug to confirm, but 
this looks like it might be exposing an iommu driver issue due to us 
changing our mapping patterns because the segment sizes are now different.

I'll see if I can get some time allotted to debug it further, but for 
now, I don't have the bandwidth, so this may need to go on hold until I 
or someone else can get time to look in to it.

Yeah that's understandable. I also currently don't have any free 
bandwidth unfortunately.

+ Christoph FYI, as per above, swiotlb API usage removal is currently a 
bit stuck until we find someone with some spare time to debug this further.

Regards,

Tvrtko