On Tue, 2025-03-04 at 13:10 +0000, Tvrtko Ursulin wrote:
There has repeatedly been quite a bit of apprehension when any change
to the DRM
scheduler is proposed, with two main reasons being code base is
considered
fragile, not well understood and not very well documented, and
secondly the lack
of systematic testing outside the vendor specific tests suites and/or
test
farms.
This series is an attempt to dislodge this status quo by adding some
unit tests
using the kunit framework.
General approach is that there is a mock "hardware" backend which can
be
controlled from tests, which in turn allows exercising various
scheduler code
paths.
Only some simple basic tests get added in the series and hopefully it
is easy to
understand what tests are doing.
An obligatory "screenshot" for reference:
[14:29:37] ============ drm_sched_basic_tests (3 subtests)
============
[14:29:38] [PASSED] drm_sched_basic_submit
[14:29:38] ================== drm_sched_basic_test
===================
[14:29:38] [PASSED] A queue of jobs in a single entity
[14:29:38] [PASSED] A chain of dependent jobs across multiple
entities
[14:29:38] [PASSED] Multiple independent job queues
[14:29:38] [PASSED] Multiple inter-dependent job queues
[14:29:38] ============== [PASSED] drm_sched_basic_test
===============
[14:29:38] [PASSED] drm_sched_basic_entity_cleanup
[14:29:38] ============== [PASSED] drm_sched_basic_tests
==============
[14:29:38] ======== drm_sched_basic_timeout_tests (1 subtest)
=========
[14:29:40] [PASSED] drm_sched_basic_timeout
[14:29:40] ========== [PASSED] drm_sched_basic_timeout_tests
==========
[14:29:40] ======= drm_sched_basic_priority_tests (2 subtests)
========
[14:29:42] [PASSED] drm_sched_priorities
[14:29:42] [PASSED] drm_sched_change_priority
[14:29:42] ========= [PASSED] drm_sched_basic_priority_tests
==========
[14:29:42] ====== drm_sched_basic_modify_sched_tests (1 subtest)
======
[14:29:43] [PASSED] drm_sched_test_modify_sched
[14:29:43] ======= [PASSED] drm_sched_basic_modify_sched_tests
========
[14:29:43]
============================================================
[14:29:43] Testing complete. Ran 10 tests: passed: 10
[14:29:43] Elapsed time: 13.330s total, 0.001s configuring, 4.005s
building, 9.276s running
Yo,
so I tried to test this all this in QEMU and I am encountering some
explosions when I activate the scheduler tests. Just DRM tests boot
fine.
I'm using a kernel on relatively current drm-misc-next: 44d2f310f008
I apply your series, then
make defconfig
make menuconfig # switch on kunit framework and scheduler tests
install everything + initramfs
Boot then causes errors as below. Just using the DRM kunit tests works
fine.
Excerpt of the first fault:
[ 1.040513] # kunit_device: pass:3 fail:0 skip:0 total:3
[ 1.040867] # Totals: pass:3 fail:0 skip:0 total:3
[ 1.041296] ok 7 kunit_device
[ 1.041936] KTAP version 1
[ 1.042186] # Subtest: kunit_fault
[ 1.042517] # module: kunit_test
[ 1.042517] 1..1
[ 1.043147] BUG: kernel NULL pointer dereference, address:
0000000000000000
[ 1.043765] #PF: supervisor write access in kernel mode
[ 1.044189] #PF: error_code(0x0002) - not-present page
[ 1.044617] PGD 0 P4D 0
[ 1.044818] Oops: Oops: 0002 [#1] PREEMPT SMP PTI
[ 1.045380] CPU: 7 UID: 0 PID: 214 Comm: kunit_try_catch Tainted:
G N 6.14.0-rc4-00387-g33e4632926a0 #8
[ 1.046262] Tainted: [N]=TEST
[ 1.046521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-2.fc40 04/01/2014
[ 1.047224] RIP: 0010:kunit_test_null_dereference+0x37/0x80
[ 1.047706] Code: 80 b5 49 c7 c0 50 7f 56 b4 ba 01 00 00 00 65 48
8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 48 8d 4c 24 07 48 c7 c6 80
8a 26 b5 <c7> 04 25 00 00 00 00 00 00 00 00 48 c7 87 70 01 00 00 a6 e9
8c b5
[ 1.049204] RSP: 0000:ffffa609807c7ec8 EFLAGS: 00010246
[ 1.049642] RAX: 0000000000000000 RBX: ffff91d982623000 RCX:
ffffa609807c7ecf
[ 1.050213] RDX: 0000000000000001 RSI: ffffffffb5268a80 RDI:
ffffa60980013c68
[ 1.050799] RBP: ffff91d98105afc0 R08: ffffffffb4567f50 R09:
ffffffffb5807ce8
[ 1.051375] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff91d98105afc0
[ 1.051941] R13: ffff91d983c749c0 R14: ffffffffb45685e0 R15:
ffff91d982623000
[ 1.052543] FS: 0000000000000000(0000) GS:ffff91e48f9c0000(0000)
knlGS:0000000000000000
[ 1.053187] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.053649] CR2: 0000000000000000 CR3: 00000004cee30000 CR4:
00000000000006f0
[ 1.054214] Call Trace:
[ 1.054427] <TASK>
[ 1.054597] ? __die+0x1e/0x60
[ 1.054844] ? page_fault_oops+0x17b/0x4a0
[ 1.055174] ? search_extable+0x26/0x30
[ 1.055482] ? kunit_test_null_dereference+0x37/0x80
[ 1.055888] ? search_module_extables+0x14/0x50
[ 1.056255] ? exc_page_fault+0x6b/0x150
[ 1.056571] ? asm_exc_page_fault+0x26/0x30
[ 1.056898] ? __pfx_kunit_generic_run_threadfn_adapter+0x10/0x10
[ 1.057387] ? __pfx_kunit_fail_assert_format+0x10/0x10
[ 1.057799] ? kunit_test_null_dereference+0x37/0x80
[ 1.058195] ? __kthread_parkme+0x33/0x80
[ 1.058523] kunit_generic_run_threadfn_adapter+0x1c/0x40
[ 1.058949] kthread+0xe9/0x1f0
[ 1.059206] ? __pfx_kthread+0x10/0x10
[ 1.059513] ret_from_fork+0x2f/0x50
[ 1.059798] ? __pfx_kthread+0x10/0x10
[ 1.060095] ret_from_fork_asm+0x1a/0x30
[ 1.060421] </TASK>
[ 1.060597] Modules linked in:
[ 1.060841] CR2: 0000000000000000
[ 1.061104] ---[ end trace 0000000000000000 ]---
[ 1.061481] RIP: 0010:kunit_test_null_dereference+0x37/0x80
I attach my kernel config and the full log file.
What's awkward is that it does not seem to be related directly to
sched, but only faults with sched.
Could you try to reproduce this, Tvrtko?