Op 18-12-12 17:12, Markus Trippelsdorf schreef: > On 2012.12.18 at 16:24 +0100, Maarten Lankhorst wrote: >> Op 18-12-12 14:38, Markus Trippelsdorf schreef: >>> On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote: >>>> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: >>>>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote: >>>>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote: >>>>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf >>>>>>> <markus@xxxxxxxxxxxxxxx> wrote: >>>>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote: >>>>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf >>>>>>>>> <markus@xxxxxxxxxxxxxxx> wrote: >>>>>>>>>> As soon as I open the following website: >>>>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html >>>>>>>>>> >>>>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable: >>>>>>>>> Is this a regression? Most likely a 3D driver bug unless you are only >>>>>>>>> seeing it with specific kernels. What browser are you using and do >>>>>>>>> you have hw accelerated webgl, etc. enabled? If so, what version of >>>>>>>>> mesa are you using? >>>>>>>> This is a regression, because it is caused by yesterdays merge of >>>>>>>> drm-next by Linus. IOW I only see this bug when running a >>>>>>>> v3.7-9432-g9360b53 kernel. >>>>>>> Can you bisect? I'm guessing it may be related to the new DMA rings. Possibly: >>>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2 >>>>>> Yes, the commit above causes the issue. >>>>>> >>>>>> 2d6cc72 GPU lockups >>>>> With 2d6cc72 reverted I get: >>>>> >>>>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------ >>>> Probably a separate issue, can you bisect this one as well? >>> Yes. Git-bisect points to: >>> >>> 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit >>> commit 85b144f860176ec18db927d6d9ecdfb24d9c6483 >>> Author: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxx> >>> Date: Thu Nov 29 11:36:54 2012 +0000 >>> >>> drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock >>> held, v3 >>> >>> (Please note that this bug is a little bit harder to reproduce. But >>> when you scroll up and down for ~10 seconds on the webpage mentioned >>> above it will trigger the oops. >>> So while I'm not 100% sure that the issue is caused by exactly this >>> commit, the vicinity should be right) >>> >> Those dmesg warnings sound suspicious, looks like something is going >> very wrong there. >> >> Can you revert the one before it? "drm/radeon: allow move_notify to be >> called without reservation" Reservation should be held at this point, >> that commit got in accidentally. >> >> I doubt not holding a reservation is causing it though, I don't really >> see how that commit could cause it however, so can you please double >> check it never happened before that point, and only started at that >> commit? >> >> also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in >> ttm_bo_cleanup_refs_and_unlock for good measure, and a >> BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait. >> >> I really don't see how that specific commit can be wrong though, so >> awaiting your results first before I try to dig more into it. > I just reran git-bisect just on your commits (from 1a1494def to 97a875cbd) > and I landed on the same commit as above: > > commit 85b144f86 (drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock held, v3) > > So now I'm pretty sure it's specifically this commit that started the > issue. > > With your supposed debugging BUG_ONs added I still get: > > Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------ > Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40() > Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name > Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174 > Dec 18 17:01:15 x4 kernel: Call Trace: > Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0 > Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170 > Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110 > Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0 > Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200 > Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40 > Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0 > Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20 > Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0 > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160 > Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0 > Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0 > Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b so the kref to fence is null here. This should be impossible and indicates a bug in refcounting somewhere, or possibly memory corruption. Lets first look where things could go wrong.. sync_obj member requires fence_lock to be taken, but radeon code in general doesn't do that, hm.. I think radeon_cs_sync_rings needs to take fence_lock during the iteration, then taking on a refcount to the fence, and radeon_crtc_page_flip and radeon_move_blit are lacking refcount on fence_lock as well. But that would probably still not explain why it crashes in radeon_vm_bo_invalidate shortly after, so it seems just as likely that it's operating on freed memory there or something. But none of the code touches refcounting for that bo, and I really don't see how I messed up anything there. I seem to be able to reproduce it if I add a hack though, can you test if you get the exact same issues if you apply this patch? I call it "aggressively evict MRU buffer, and never call ddestroy", and for me it triggers by merely starting X. :-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 0bf66f9..9a8f0d8 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -512,6 +512,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) spin_lock(&glob->lru_lock); ret = ttm_bo_reserve_locked(bo, false, true, false, 0); + goto skip; spin_lock(&bdev->fence_lock); (void) ttm_bo_wait(bo, false, false, true); if (!ret && !bo->sync_obj && 0) { @@ -529,6 +530,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) sync_obj = driver->sync_obj_ref(bo->sync_obj); spin_unlock(&bdev->fence_lock); +skip: if (!ret) { atomic_set(&bo->reserved, 0); wake_up_all(&bo->event_queue); @@ -542,8 +544,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo) driver->sync_obj_flush(sync_obj); driver->sync_obj_unref(&sync_obj); } - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } /** @@ -699,8 +700,7 @@ static void ttm_bo_delayed_workqueue(struct work_struct *work) container_of(work, struct ttm_bo_device, wq.work); if (ttm_bo_delayed_delete(bdev, false)) { - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } } @@ -743,8 +743,7 @@ EXPORT_SYMBOL(ttm_bo_lock_delayed_workqueue); void ttm_bo_unlock_delayed_workqueue(struct ttm_bo_device *bdev, int resched) { if (resched) - schedule_delayed_work(&bdev->wq, - ((HZ / 100) < 1) ? 1 : HZ / 100); + schedule_delayed_work(&bdev->wq, HZ * 100); } EXPORT_SYMBOL(ttm_bo_unlock_delayed_workqueue); @@ -815,12 +814,15 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev, retry: spin_lock(&glob->lru_lock); - if (list_empty(&man->lru)) { - spin_unlock(&glob->lru_lock); - return -EBUSY; - } + if (list_empty(&bdev->ddestroy)) { + if (list_empty(&man->lru)) { + spin_unlock(&glob->lru_lock); + return -EBUSY; + } + bo = list_entry(man->lru.prev, struct ttm_buffer_object, lru); + } else + bo = list_entry(bdev->ddestroy.prev, struct ttm_buffer_object, ddestroy); - bo = list_first_entry(&man->lru, struct ttm_buffer_object, lru); kref_get(&bo->list_kref); if (!list_empty(&bo->ddestroy)) { _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel