On Tue, Mar 25, 2014 at 11:23:46AM +0000, Dmitry Malkin wrote: > Hi, > will you accept bash scripts for reloading driver/X/FLR for intel-gpu-tools to uncover exists and future bugs besides those patches? i-g-t/tests/drv_module_reload we have already, so basic coverage is there. It it's run in our nightly test runs on drm-intel-nightly. But that scripts tries very hard not to cause a race so more fancy stuff is still needed. But until we have a more solid driver I don't think there's much point for crazier scripts, it'll just cause headaches. -Daniel > ________________________________________ > From: Daniel Vetter <daniel.vetter@xxxxxxxx> on behalf of Daniel Vetter <daniel@xxxxxxxx> > Sent: Tuesday, March 25, 2014 2:43 PM > To: Dmitry Malkin > Cc: Daniel Vetter; dri-devel@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [DRM] drm_get_connector_name internal static string buffer causes a race > > On Tue, Mar 25, 2014 at 08:08:23AM +0000, Dmitry Malkin wrote: > > Hello, Daniel, > > > > Thank you for response. I've got a couple questions for you: > > > > 0. What do you think about making integration test with continuous reloading i915 driver and X server (with FLR between iteration)? > > Simplified example for ubuntu (root required): > > Module reloading is know to be horribly racy atm. Don't do that in any > kind of stress-test situation before fixing up piles of bugs ;-) > > Will mean: You'll uncover at _lot_ more than just this. Patches for all > this highly welcome though. > > Thanks, Daniel > > > > > #!/bin/bash > > service lightdm stop > > rmmod snd_hda_intel > > echo -n "0000:00:02.0" > /sys/bus/pci/devices/0000\:00\:02.0/driver/unbind > > rmmod i915 > > echo 1 > /sys/bus/pci/devices/0000\:00\:02.0/reset > > modprobe i915 > > service lightdm start > > > > This will uncover next problems: > > * sysfs add/remove i2c connectors (parent/child warning) > > * drm static buffer races > > * NX bit violation crash (see dump below) > > * and bunch of DMAR problems > > > > > > 1. Could you point me out git/branch with FIXME comments? > > > > 2. About kfree() problem: what if put string buffer onto stack of caller for drm_get_connector_name and drm_get_encoder_name? > > It will be auto-removed and there is will be the patch about adding new param for these functions. > > (yes the patch will be big and awful to read) > > > > ======================= NX bit violation crash ======================================== > > Mar 20 21:53:29 haswell01 kernel: [13447.100849] Console: switching to colour dummy device 80x25 > > Mar 20 21:53:29 haswell01 kernel: [13447.100950] drm_kms_helper: drm: unregistered panic notifier > > Mar 20 21:53:29 haswell01 kernel: [13447.117785] waiting module removal not supported: please upgrade > > Mar 20 21:53:29 haswell01 kernel: [13447.117880] [drm] Module unloaded > > Mar 20 21:53:29 haswell01 kernel: [13447.118244] waiting module removal not supported: please upgrade > > Mar 20 21:53:29 haswell01 kernel: [13447.118360] waiting module removal not supported: please upgradewaiting module removal not supported: please upgrade > > Mar 20 21:53:39 haswell01 kernel: [13447.118590] waiting module removal not supported: please upgrade<6>[13457.263808] [drm] Initialized drm 1.1.0 20060810 > > Mar 20 21:53:39 haswell01 kernel: [13457.333992] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) > > Mar 20 21:53:39 haswell01 kernel: [13457.333996] BUG: unable to handle kernel paging request at ffff8803eb27fcb0 > > Mar 20 21:53:39 haswell01 kernel: [13457.333998] IP: [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334001] PGD 2fd9067 PUD 3e68c3063 PMD 404902063 PTE 80000003eb27f163 > > Mar 20 21:53:39 haswell01 kernel: [13457.334004] Oops: 0011 [#1] SMP > > Mar 20 21:53:39 haswell01 kernel: [13457.334006] Modules linked in: i915(+) video drm_kms_helper drm i2c_algo_bit snd_hda_codec_hdmi rfcomm bnep bluetooth nls_iso8859_1 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek kvm snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq hid_generic crct10dif_pclmul snd_seq_device usbhid crc32_pclmul snd_timer ghash_clmulni_intel snd aesni_intel psmouse hid mei_me aes_x86_64 lrw ppdev gf128mul mei glue_helper parport_pc ablk_helper lp cryptd soundcore parport serio_raw lpc_ich mac_hid e1000e ahci ptp libahci pps_core [last unloaded: video] > > Mar 20 21:53:39 haswell01 kernel: [13457.334031] CPU: 0 PID: 4974 Comm: modprobe Not tainted 3.13.6 #10 > > Mar 20 21:53:39 haswell01 kernel: [13457.334032] Hardware name: /DQ87PG, BIOS PGQ8710H.86A.0144.2014.0113.1604 01/13/2014 > > Mar 20 21:53:39 haswell01 kernel: [13457.334034] task: ffff88002e6c5fc0 ti: ffff880406394000 task.ti: ffff880406394000 > > Mar 20 21:53:39 haswell01 kernel: [13457.334035] RIP: 0010:[<ffff8803eb27fcb0>] [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334037] RSP: 0018:ffff880406395af0 EFLAGS: 00010282 > > Mar 20 21:53:39 haswell01 kernel: [13457.334038] RAX: ffff8803eb27f4b0 RBX: ffff8803f6016000 RCX: ffffffffa0238630 > > Mar 20 21:53:39 haswell01 kernel: [13457.334039] RDX: ffff8803eb27fcb0 RSI: ffffffffa03ba3c4 RDI: ffff8803f6016000 > > Mar 20 21:53:39 haswell01 kernel: [13457.334040] RBP: ffff880406395b10 R08: 0000000000000000 R09: ffff88041ea172e0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334042] R10: ffffea00101d7200 R11: 0000000000000000 R12: 0000000000000000 > > Mar 20 21:53:39 haswell01 kernel: [13457.334043] R13: ffffffffa03ba3c4 R14: ffffffffa03dd100 R15: 0000000000000000 > > Mar 20 21:53:39 haswell01 kernel: [13457.334044] FS: 00007f3f392d0740(0000) GS:ffff88041ea00000(0000) knlGS:0000000000000000 > > Mar 20 21:53:39 haswell01 kernel: [13457.334046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > Mar 20 21:53:39 haswell01 kernel: [13457.334047] CR2: ffff8803eb27fcb0 CR3: 00000003e7358000 CR4: 00000000001407f0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > Mar 20 21:53:39 haswell01 kernel: [13457.334050] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > Mar 20 21:53:39 haswell01 kernel: [13457.334051] Stack: > > Mar 20 21:53:39 haswell01 kernel: [13457.334052] ffffffffa0213cb2 ffff8803f6016000 ffff880408199000 ffff880408199098 > > Mar 20 21:53:39 haswell01 kernel: [13457.334054] ffff880406395b60 ffffffffa0215b92 0000000000000004 ffff880408199140 > > Mar 20 21:53:39 haswell01 kernel: [13457.334057] ffffffffa03b9920 ffff880408199000 0000000000000000 ffffffffa03dd000 > > Mar 20 21:53:39 haswell01 kernel: [13457.334059] Call Trace: > > Mar 20 21:53:39 haswell01 kernel: [13457.334067] [<ffffffffa0213cb2>] ? drm_dev_register+0xa2/0x1e0 [drm] > > Mar 20 21:53:39 haswell01 kernel: [13457.334073] [<ffffffffa0215b92>] drm_get_pci_dev+0x92/0x140 [drm] > > Mar 20 21:53:39 haswell01 kernel: [13457.334082] [<ffffffffa033d67c>] i915_pci_probe+0x3c/0x90 [i915] > > Mar 20 21:53:39 haswell01 kernel: [13457.334086] [<ffffffff813982b5>] local_pci_probe+0x45/0xa0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334088] [<ffffffff81399555>] ? pci_match_device+0xc5/0xd0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334090] [<ffffffff81399679>] pci_device_probe+0xd9/0x130 > > Mar 20 21:53:39 haswell01 kernel: [13457.334093] [<ffffffff81484795>] driver_probe_device+0x125/0x3b0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334095] [<ffffffff81484af3>] __driver_attach+0x93/0xa0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334098] [<ffffffff81484a60>] ? __device_attach+0x40/0x40 > > Mar 20 21:53:39 haswell01 kernel: [13457.334100] [<ffffffff81482703>] bus_for_each_dev+0x63/0xa0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334102] [<ffffffff8148414e>] driver_attach+0x1e/0x20 > > Mar 20 21:53:39 haswell01 kernel: [13457.334104] [<ffffffff81483d30>] bus_add_driver+0x180/0x250 > > Mar 20 21:53:39 haswell01 kernel: [13457.334106] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff > > Mar 20 21:53:39 haswell01 kernel: [13457.334109] [<ffffffff81485174>] driver_register+0x64/0xf0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334110] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff > > Mar 20 21:53:39 haswell01 kernel: [13457.334113] [<ffffffff81397c4c>] __pci_register_driver+0x4c/0x50 > > Mar 20 21:53:39 haswell01 kernel: [13457.334117] [<ffffffffa0215d5a>] drm_pci_init+0x11a/0x130 [drm] > > Mar 20 21:53:39 haswell01 kernel: [13457.334119] [<ffffffffa03fe000>] ? 0xffffffffa03fdfff > > Mar 20 21:53:39 haswell01 kernel: [13457.334127] [<ffffffffa03fe066>] i915_init+0x66/0x68 [i915] > > Mar 20 21:53:39 haswell01 kernel: [13457.334130] [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334133] [<ffffffff810589e3>] ? set_memory_nx+0x43/0x50 > > Mar 20 21:53:39 haswell01 kernel: [13457.334137] [<ffffffff810e0630>] load_module+0x2050/0x26f0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334139] [<ffffffff810dbfa0>] ? store_uevent+0x40/0x40 > > Mar 20 21:53:39 haswell01 kernel: [13457.334141] [<ffffffff810e0e46>] SyS_finit_module+0x86/0xb0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334144] [<ffffffff8171823f>] tracesys+0xe1/0xe6 > > Mar 20 21:53:39 haswell01 kernel: [13457.334145] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 <f0> b1 ca 81 ff ff ff ff b0 f4 27 eb 03 88 ff ff ff ff ff 7f 00 > > Mar 20 21:53:39 haswell01 kernel: [13457.334164] RIP [<ffff8803eb27fcb0>] 0xffff8803eb27fcb0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334166] RSP <ffff880406395af0> > > Mar 20 21:53:39 haswell01 kernel: [13457.334167] CR2: ffff8803eb27fcb0 > > Mar 20 21:53:39 haswell01 kernel: [13457.334168] ---[ end trace e2598b1fc83a65fd ]--- > > > > > > ________________________________________ > > From: Daniel Vetter <daniel.vetter@xxxxxxxx> on behalf of Daniel Vetter <daniel@xxxxxxxx> > > Sent: Tuesday, March 25, 2014 12:31 AM > > To: Dmitry Malkin > > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx > > Subject: Re: [DRM] drm_get_connector_name internal static string buffer causes a race > > > > On Mon, Mar 24, 2014 at 12:06:21PM +0000, Dmitry Malkin wrote: > > > > > > Hello guys, > > > > > > I've been playing with reloading intel gfx driver (i915) in a cycle, for a while, > > > and at some point I've found a non-deterministic kernel crash with a highly-variable > > > iteration dependency -- 2 to 200 driver reload iterations. > > > > > > The apparent race is over the shared internal string buffer in drm_get_connector_name(). > > > It is mostly harmless, due to its results being mostly used for log output, but in at least one > > > case -- drm_sysfs_connector_add() -- this leads to a more critical error. > > > > > > Race scenario: > > > > > > - drm_sysfs_connector_add() > > > - drm_get_connector_name() > > > vs. > > > - anything that generates log messages involving DRM connectors > > > - drm_get_connector_name() > > > > > > (and many other from drm_crtc.c) shares with caller const char* to internal static char buffer. > > > If something call it from other thread, while main thread strore and use returned pointer it may overwrite connector name. > > > > > > Here are we go: registering HDMI connecter (drm_sysfs_connector_add store and use pointer from drm_get_connector_name) > > > and the same time got VGA connector name down through the stack. (the second thread is upowerd who watch continuously sysfs) > > > > Yeah, in my recent kerneldoc series I've noticed this too and added FIXME > > comments. There's a lot more than just those you've pointed out. The > > problem is that fixing these will be an awful lot of work since you need > > to add proper cleanup code (calling kfree()) to all the required places. > > > > So a full subsystem wide code audit for every single use-site of these. > > -Daniel > > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel