Re: [PATCH libdrm 0/4] Dynamicly disable suites and tets.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 11/10/2017 10:48 AM, Christian König wrote:
Am 10.11.2017 um 16:36 schrieb Andrey Grodzovsky:


On 11/10/2017 07:17 AM, Christian König wrote:
Series is Acked-by: Christian König <christian.koenig@xxxxxxx>.

Please note that I think your OOM killer test shows quite a bug we currently have in the kernel driver.

A single allocation of 1TB shouldn't trigger the OOM killer, but rather be reacted immediately.

Maybe we should add a second test which does incremental 1GB allocations but still keep this tests ? With this test i get a callstack as bellow + crash of the test suite with general protection fault - As normal behavior I would have expected just some errno returning from the amdgpu_bo_alloc which we could check in the test.

Yeah, totally agree. And when this works correctly we should really enable this test case by default as well.

When I implemented scattered eviction I completely removed the check which limited the BO size. That was probably a bit to extreme.

We still need to check the size here so that we don't create a BO larger than what makes sense for the domain it should be stored in.

Patch attached, tested with the DRM tester, call stack is gone but I still get SIGSEV and tester crashes, attaching debugger shows SIGSEV recived when the tester is still in the IOCTL -
r = drmCommandWriteRead(dev->fd, DRM_AMDGPU_GEM_CREATE,
                &args, sizeof(args));

dmesg

[ 104.608664 < 16.227791>] [drm:amdgpu_bo_do_create [amdgpu]] *ERROR* BO size 1000000000000 > total memory in domain: 1073741824 [ 104.608911 < 0.000247>] [drm:amdgpu_bo_do_create [amdgpu]] *ERROR* BO size 1000000000000 > total memory in domain: 3221225472 [ 104.609168 < 0.000257>] [drm:amdgpu_gem_object_create [amdgpu]] *ERROR* Failed to allocate GEM object (1000000000000, 6, 4096, -12) [ 104.609301 < 0.000133>] traps: lt-amdgpu_test[1142] general protection ip:7f21c9ed6007 sp:7ffe08ae1e30 error:0 in libdrm_amdgpu.so.1.0.0[7f21c9ed2000+b000]


Thanks,
Andrey


Regards,
Christian.


Thanks,
Andrey

[169053.128981 <72032.811683>] ------------[ cut here ]------------
[169053.129006 < 0.000025>] WARNING: CPU: 0 PID: 22883 at mm/page_alloc.c:3883 __alloc_pages_slowpath+0xf03/0x14e0 [169053.129007 < 0.000001>] Modules linked in: amdgpu chash ttm drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic pcbc snd_hda_codec_hdmi snd_hda_intel aesni_intel snd_hda_codec aes_x86_64 snd_hda_core crypto_simd glue_helper snd_hwdep rfkill_gpio cryptd snd_pcm snd_seq_midi snd_seq_midi_event serio_raw snd_rawmidi snd_seq cdc_ether usbnet snd_seq_device joydev fam15h_power k10temp r8152 snd_timer mii i2c_piix4 rtsx_pci_ms snd memstick soundcore shpchp 8250_dw i2c_designware_platform i2c_designware_core mac_hid binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc parport_pc ppdev lp parport autofs4 rtsx_pci_sdmmc psmouse rtsx_pci sdhci_pci ahci sdhci libahci
[169053.129084 <    0.000077>]  video i2c_hid hid_generic usbhid hid
[169053.129096 < 0.000012>] CPU: 0 PID: 22883 Comm: lt-amdgpu_test Tainted: G W 4.14.0-rc3+ #1 [169053.129097 < 0.000001>] Hardware name: AMD Gardenia/Gardenia, BIOS RGA1101C 07/20/2015 [169053.129099 < 0.000002>] task: ffff880048803d80 task.stack: ffff880064688000 [169053.129103 < 0.000004>] RIP: 0010:__alloc_pages_slowpath+0xf03/0x14e0 [169053.129105 < 0.000002>] RSP: 0018:ffff88006468f108 EFLAGS: 00010246 [169053.129108 < 0.000003>] RAX: 0000000000000000 RBX: 00000000014000c0 RCX: ffffffff81279065 [169053.129109 < 0.000001>] RDX: dffffc0000000000 RSI: 000000000000000f RDI: ffffffff82609000 [169053.129111 < 0.000002>] RBP: ffff88006468f328 R08: 0000000000000000 R09: ffffffffffff8576 [169053.129113 < 0.000002>] R10: 000000005c2044e7 R11: 0000000000000000 R12: ffff88006468f3d8 [169053.129114 < 0.000001>] R13: ffff880048803d80 R14: 000000000140c0c0 R15: 000000000000000f [169053.129117 < 0.000003>] FS: 00007f707863b700(0000) GS:ffff88006ce00000(0000) knlGS:0000000000000000 [169053.129119 < 0.000002>] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [169053.129120 < 0.000001>] CR2: 0000000001250000 CR3: 00000000644cf000 CR4: 00000000001406f0
[169053.129122 <    0.000002>] Call Trace:
[169053.129131 <    0.000009>]  ? __module_address+0x145/0x190
[169053.129135 <    0.000004>]  ? is_bpf_text_address+0xe/0x20
[169053.129140 <    0.000005>]  ? __kernel_text_address+0x12/0x40
[169053.129144 <    0.000004>]  ? unwind_get_return_address+0x36/0x50
[169053.129150 <    0.000006>]  ? memcmp+0x5b/0x90
[169053.129152 <    0.000002>]  ? warn_alloc+0x250/0x250
[169053.129156 <    0.000004>]  ? get_page_from_freelist+0x147/0x10f0
[169053.129160 <    0.000004>]  ? save_stack_trace+0x1b/0x20
[169053.129164 <    0.000004>]  ? kasan_kmalloc+0xad/0xe0
[169053.129186 <    0.000022>]  ? ttm_bo_mem_space+0x79/0x6b0 [ttm]
[169053.129196 <    0.000010>]  ? ttm_bo_validate+0x178/0x220 [ttm]
[169053.129200 <    0.000004>] __alloc_pages_nodemask+0x3c4/0x400
[169053.129203 <    0.000003>]  ? __alloc_pages_slowpath+0x14e0/0x14e0
[169053.129205 <    0.000002>]  ? __save_stack_trace+0x66/0xd0
[169053.129209 <    0.000004>]  ? rb_insert_color+0x32/0x3e0
[169053.129213 <    0.000004>]  ? do_syscall_64+0xea/0x280
[169053.129217 <    0.000004>] alloc_pages_current+0x75/0x110
[169053.129221 <    0.000004>]  kmalloc_order+0x1f/0x80
[169053.129223 <    0.000002>] kmalloc_order_trace+0x24/0xa0
[169053.129226 <    0.000003>]  __kmalloc+0x264/0x280
[169053.129383 <    0.000157>] amdgpu_vram_mgr_new+0x11b/0x3b0 [amdgpu]
[169053.129391 < 0.000008>] ? reservation_object_reserve_shared+0x64/0xf0
[169053.129401 <    0.000010>] ttm_bo_mem_space+0x196/0x6b0 [ttm]
[169053.129478 <    0.000077>]  ? add_hole+0x20a/0x220 [drm]
[169053.129489 <    0.000011>] ttm_bo_validate+0x178/0x220 [ttm]
[169053.129498 <    0.000009>]  ? ttm_bo_evict_mm+0x70/0x70 [ttm]
[169053.129508 <    0.000010>]  ? ttm_check_swapping+0xf6/0x110 [ttm]
[169053.129541 <    0.000033>]  ? drm_vma_offset_add+0x5b/0x80 [drm]
[169053.129572 <    0.000031>]  ? drm_vma_offset_add+0x68/0x80 [drm]
[169053.129584 <    0.000012>] ttm_bo_init_reserved+0x546/0x630 [ttm]
[169053.129716 <    0.000132>] amdgpu_bo_do_create+0x28b/0x630 [amdgpu]
[169053.129816 < 0.000100>] ? amdgpu_fill_buffer+0x580/0x580 [amdgpu] [169053.129952 < 0.000136>] ? amdgpu_ttm_placement_from_domain+0x320/0x320 [amdgpu]
[169053.129956 <    0.000004>]  ? try_to_wake_up+0xbe/0x720
[169053.130054 <    0.000098>] amdgpu_bo_create+0x85/0x400 [amdgpu]
[169053.130153 < 0.000099>] ? amdgpu_bo_do_create+0x630/0x630 [amdgpu]
[169053.130155 <    0.000002>]  ? wake_up_process+0x15/0x20
[169053.130158 <    0.000003>]  ? insert_work+0xf3/0x110
[169053.130257 < 0.000099>] amdgpu_gem_object_create+0x101/0x190 [amdgpu] [169053.130356 < 0.000099>] ? amdgpu_gem_object_free+0xe0/0xe0 [amdgpu] [169053.130360 < 0.000004>] ? tty_insert_flip_string_fixed_flag+0xab/0x110 [169053.130468 < 0.000108>] amdgpu_gem_create_ioctl+0x364/0x460 [amdgpu] [169053.130695 < 0.000227>] ? amdgpu_gem_object_close+0x320/0x320 [amdgpu]
[169053.130767 <    0.000072>]  ? drm_dev_printk+0x120/0x120 [drm]
[169053.130840 <    0.000073>]  ? __wake_up_common_lock+0xe9/0x170
[169053.130989 < 0.000149>] ? amdgpu_gem_object_close+0x320/0x320 [amdgpu]
[169053.131061 <    0.000072>]  drm_ioctl_kernel+0xae/0xf0 [drm]
[169053.131115 <    0.000054>]  drm_ioctl+0x466/0x520 [drm]
[169053.131238 < 0.000123>] ? amdgpu_gem_object_close+0x320/0x320 [amdgpu]
[169053.131291 <    0.000053>]  ? drm_getunique+0xf0/0xf0 [drm]
[169053.131426 <    0.000135>]  amdgpu_drm_ioctl+0x78/0xd0 [amdgpu]
[169053.131451 <    0.000025>]  do_vfs_ioctl+0x12e/0x860
[169053.131466 <    0.000015>]  ? apparmor_file_permission+0x1a/0x20
[169053.131489 <    0.000023>]  ? ioctl_preallocate+0x130/0x130
[169053.131503 <    0.000014>]  ? rw_verify_area+0x78/0x140
[169053.131520 <    0.000017>]  ? vfs_write+0x1a2/0x260
[169053.131544 <    0.000024>]  ? syscall_trace_enter+0x1fd/0x520
[169053.131568 <    0.000024>]  ? sched_clock+0x9/0x10
[169053.131584 <    0.000016>]  ? exit_to_usermode_loop+0xc0/0xc0
[169053.131607 <    0.000023>]  ? __fget_light+0xa7/0xc0
[169053.131631 <    0.000024>]  SyS_ioctl+0x79/0x90
[169053.131651 < 0.000020>] ? __context_tracking_exit.part.4+0x53/0xc0
[169053.131672 <    0.000021>]  ? do_vfs_ioctl+0x860/0x860
[169053.131683 <    0.000011>]  do_syscall_64+0xea/0x280
[169053.131708 <    0.000025>] entry_SYSCALL64_slow_path+0x25/0x25
[169053.131720 <    0.000012>] RIP: 0033:0x7f70778eef07
[169053.131740 < 0.000020>] RSP: 002b:00007ffc509d13d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [169053.131756 < 0.000016>] RAX: ffffffffffffffda RBX: 000000000000001e RCX: 00007f70778eef07 [169053.131778 < 0.000022>] RDX: 00007ffc509d1490 RSI: 00000000c0206440 RDI: 0000000000000004 [169053.131798 < 0.000020>] RBP: 00007ffc509d1410 R08: 000000000124c660 R09: 0000000000000000 [169053.131815 < 0.000017>] R10: 000000000000006e R11: 0000000000000202 R12: 000000000124b530 [169053.131835 < 0.000020>] R13: 00007ffc509d1800 R14: 0000000000000000 R15: 0000000000000000 [169053.131854 < 0.000019>] Code: 89 85 c8 fe ff ff e9 5d fc ff ff 8d 42 ff 45 31 f6 c6 85 d0 fe ff ff 01 89 85 c8 fe ff ff e9 45 fc ff ff 41 89 c5 e9 10 fc ff ff <0f> ff e9 ba f1 ff ff 0f ff 89 d8 25 ff ff f7 ff 89 85 8c fe ff
[169053.131933 <    0.000079>] ---[ end trace 8253dc1e92579724 ]---
[169053.132622 < 0.000689>] [drm:amdgpu_gem_object_create [amdgpu]] *ERROR* Failed to allocate GEM object (1000000000000, 6, 4096, -12) [169053.132877 < 0.000255>] traps: lt-amdgpu_test[22883] general protection ip:7f7077ff6007 sp:7ffc509d13e0 error:0 in libdrm_amdgpu.so.1.0.0[7f7077ff2000+b000]



Instead I expected that we need to do multiple 1GB allocations to trigger the next problem that our TTM code doesn't imply a global limit.

Regards,
Christian.

Am 10.11.2017 um 05:29 schrieb Andrey Grodzovsky:
THe following patch series intoroduce dynamic tests dusabling/enabling
in amdgpu  tester using Cunit API. Today test suits that
don't apply to specific HW just return success w/o executing while
single tests that can't be executed properly are commented out.

Suits are diasbled based on hooks they provide (e.g incompatible
ASIC or missing blocks) while single tests are diasbled explicitly since this is usually due to some bug preventing from the tester or the system to handle
the test w/o crashing or killing the tester.

Inside this series also a minor cleanup and new test for memory over allocation.

Andrey Grodzovsky (4):
   amdgpu: Add functions to disable suites and tests.
   amdgpu: Use new suite/test disabling functionality.
   amdgpu: Move memory alloc tests in bo suite.
   amdgpu: Add memory over allocation test.

tests/amdgpu/amdgpu_test.c | 169 +++++++++++++++++++++++++++++++++++++-----
  tests/amdgpu/amdgpu_test.h    |  46 ++++++++++++
  tests/amdgpu/basic_tests.c    |  49 ------------
  tests/amdgpu/bo_tests.c       |  69 +++++++++++++++++
  tests/amdgpu/deadlock_tests.c |   8 +-
  tests/amdgpu/uvd_enc_tests.c  |  81 ++++++++------------
  tests/amdgpu/vce_tests.c      |  65 ++++++++--------
  tests/amdgpu/vcn_tests.c      |  74 ++++++++----------
  8 files changed, 363 insertions(+), 198 deletions(-)





>From 183abb730742cfc5b61374a3af575ca9ca4d731d Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Date: Fri, 10 Nov 2017 18:35:56 -0500
Subject: drm/amdgpu: Implement BO size validation.

Validates BO size against each requested domain's total memory.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 39 ++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index a937c49..2737631 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -281,6 +281,42 @@ void amdgpu_bo_free_kernel(struct amdgpu_bo **bo, u64 *gpu_addr,
 		*cpu_addr = NULL;
 }
 
+/* Validate bo size is bit bigger then the request domain */
+static inline bool amdgpu_bo_validate_bo_size(struct amdgpu_device *adev,
+					  unsigned long size, u32 domain)
+{
+	struct ttm_mem_type_manager *man = NULL;
+
+	if (domain & AMDGPU_GEM_DOMAIN_VRAM) {
+		man = &adev->mman.bdev.man[TTM_PL_VRAM];
+
+		if (man && size < (man->size << PAGE_SHIFT))
+			return true;
+	}
+
+	if (domain & AMDGPU_GEM_DOMAIN_GTT) {
+		man = &adev->mman.bdev.man[TTM_PL_TT];
+
+		if (man && size < (man->size << PAGE_SHIFT))
+			return true;
+	}
+
+	if (domain & AMDGPU_GEM_DOMAIN_CPU) {
+		man = &adev->mman.bdev.man[TTM_PL_SYSTEM];
+
+		if (man && size < (man->size << PAGE_SHIFT))
+			return true;
+	}
+
+	if (man) {
+		DRM_ERROR("BO size %lu > total memory in domain: %llu\n", size,
+						      man->size << PAGE_SHIFT);
+		return false;
+	}
+
+	return true;
+}
+
 static int amdgpu_bo_do_create(struct amdgpu_device *adev,
 			       unsigned long size, int byte_align,
 			       bool kernel, u32 domain, u64 flags,
@@ -299,6 +335,9 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
 	page_align = roundup(byte_align, PAGE_SIZE) >> PAGE_SHIFT;
 	size = ALIGN(size, PAGE_SIZE);
 
+	if (!amdgpu_bo_validate_bo_size(adev, size, domain))
+		return -ENOMEM;
+
 	if (kernel) {
 		type = ttm_bo_type_kernel;
 	} else if (sg) {
-- 
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux