On 28/07/2022 16:54, Robert Beckett wrote:
On 28/07/2022 15:03, Tvrtko Ursulin wrote:
On 28/07/2022 09:01, Patchwork wrote:
[snip]
Possible regressions
* igt@gem_mmap_offset@clear:
o shard-iclb: PASS
<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11946/shard-iclb6/igt@gem_mmap_offset@xxxxxxxxxx>
-> INCOMPLETE
<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_106589v6/shard-iclb1/igt@gem_mmap_offset@xxxxxxxxxx>
What was supposed to be a simple patch.. a storm of errors like:
yeah, them's the breaks sometimes ....
DMAR: ERROR: DMA PTE for vPFN 0x3d00000 already set (to 2fd7ff003
not 2fd7ff003)
------------[ cut here ]------------
WARNING: CPU: 6 PID: 1254 at drivers/iommu/intel/iommu.c:2278
__domain_mapping.cold.93+0x32/0x39<>
Modules linked in: vgem drm_shmem_helper snd_hda_codec_hdmi
snd_hda_codec_realtek snd_hda_cod>
CPU: 6 PID: 1254 Comm: gem_mmap_offset Not tainted
5.19.0-rc8-Patchwork_106589v6-g0e9c43d76a14+ #>
Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U
DDR4 SODIMM PD RVP TLC, BIOS >
RIP: 0010:__domain_mapping.cold.93+0x32/0x39
Code: fe 48 c7 c7 28 32 37 82 4c 89 5c 24 08 e8 e4 61 fd ff 8b 05 bf
8e c9 00 4c 8b 5c 24 08 85 c>
RSP: 0000:ffffc9000037f9c0 EFLAGS: 00010202
RAX: 0000000000000004 RBX: ffff8881117b4000 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffff82320b25 RDI: 00000000ffffffff
RBP: 0000000000000001 R08: 0000000000000000 R09: c0000000ffff7fff
R10: 0000000000000001 R11: 00000000002fd7ff R12: 00000002fd7ff003
R13: 0000000000076c01 R14: ffff8881039ee800 R15: 0000000003d00000
FS: 00007f2863c1d700(0000) GS:ffff88849fd00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2692c53000 CR3: 000000011c440006 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
intel_iommu_map_pages+0xb7/0xe0
__iommu_map+0xe0/0x310
__iommu_map_sg+0xa2/0x140
iommu_dma_map_sg+0x2ef/0x4e0
__dma_map_sg_attrs+0x64/0x70
dma_map_sg_attrs+0x5/0x20
i915_gem_gtt_prepare_pages+0x56/0x70 [i915]
shmem_get_pages+0xe3/0x360 [i915]
____i915_gem_object_get_pages+0x32/0x100 [i915]
__i915_gem_object_get_pages+0x8d/0xa0 [i915]
vm_fault_gtt+0x3d0/0x940 [i915]
? ptlock_alloc+0x15/0x40
? rt_mutex_debug_task_free+0x91/0xa0
__do_fault+0x30/0x180
do_fault+0x1c4/0x4c0
__handle_mm_fault+0x615/0xbe0
handle_mm_fault+0x75/0x1c0
do_user_addr_fault+0x1e7/0x670
exc_page_fault+0x62/0x230
asm_exc_page_fault+0x22/0x30
No idea. Maybe try CI kernel config on your Tigerlake?
I have an idea of what could be happening:
The warning is due to a pte already existing. We can see from the
warning that it is the same value, which indicates that the same page
has been mapped to the same iova before.
This map shrink loop will keep mapping the same sg, shrinking if it
fails to hopefully free up iova space.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/i915/i915_gem_gtt.c?h=v5.19-rc8#n32
If we now look at the intel iommu driver's mapping function:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/intel/iommu.c?h=v5.19-rc8#n2248
If that -ENOMEM loop breaking return is hit (presumably running out of
pte space, though I have not delved deeper), then it will return back up
the stack, eventually returning 0 from dma_map_sg_attrs() indicating the
error. This will cause a shrink and retry.
The problem is that the iommu does not undo it's partial mapping on
error. So the next time round, it will map the same page to the same
address giving the same pte encoding, which would give the warning
observed.
I would need to get some time to try to repro and debug to confirm, but
this looks like it might be exposing an iommu driver issue due to us
changing our mapping patterns because the segment sizes are now different.
I'll see if I can get some time allotted to debug it further, but for
now, I don't have the bandwidth, so this may need to go on hold until I
or someone else can get time to look in to it.
Yeah that's understandable. I also currently don't have any free
bandwidth unfortunately.
+ Christoph FYI, as per above, swiotlb API usage removal is currently a
bit stuck until we find someone with some spare time to debug this further.
Regards,
Tvrtko