Re: investigation into spice kms crash

Mike <ipso@xxxxxxxxxxxxx> · Wed, 15 Oct 2014 14:34:39 -0700

Hi David,

	Just to add another data point to this, two of my colleagues are seeing 
the exact same issue, however its happening on a very underutilized 
gigabit LAN, connected to a very underutilized baremetal server. It 
seems to occur about once or twice per day for one (he swears it mostly 
happens when he opens firefox by clicking on a link in thunderbird), and 
about once per week for the other.  They are both on the exact same 
guest OS's hosted on the exact same server.

This just started happening after upgrading the guest OS and server OS 
from Ubuntu 13.04 to 14.04.

Is there anything I can do to help track this down?

Here is sample trace:
Oct 15 13:48:43 timetrex1 kernel: [108682.060088] WARNING: CPU: 0 PID: 1369 at /build/buildd/linux-3.13.0/drivers/gpu/drm/qxl/qxl_ttm.c:420 qxl_sync_obj_wait+0x172/0x1f0 [qxl]()
Oct 15 13:48:43 timetrex1 kernel: [108682.060090] sync obj 301 still has outstanding releases 0 0 0 262144 1
Oct 15 13:48:43 timetrex1 kernel: [108682.060091] Modules linked in: arc4 md4 nls_utf8 cifs kvm_intel snd_hda_intel kvm snd_hda_codec rfcomm bnep bluetooth snd_hwdep crct10dif_pclmul snd_pcm crc32_pclmul ghash_clmulni_intel aesni_intel snd_page_alloc snd_seq_midi snd_seq_midi_event aes_x86_64 snd_rawmidi snd_seq snd_seq_device snd_timer lrw gf128mul snd soundcore glue_helper ablk_helper cryptd serio_raw i2c_piix4 mac_hid nfsd auth_rpcgss nfs_acl nfs lockd parport_pc ppdev lp sunrpc parport fscache qxl ttm drm_kms_helper psmouse drm floppy
Oct 15 13:48:43 timetrex1 kernel: [108682.060112] CPU: 0 PID: 1369 Comm: Xorg Not tainted 3.13.0-34-generic #60-Ubuntu
Oct 15 13:48:43 timetrex1 kernel: [108682.060114] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Oct 15 13:48:43 timetrex1 kernel: [108682.060115]  0000000000000009 ffff8800b8c598b0 ffffffff8171bd94 ffff8800b8c598f8
Oct 15 13:48:43 timetrex1 kernel: [108682.060117]  ffff8800b8c598e8 ffffffff810676cd 0000000000000001 ffff88003595c740
Oct 15 13:48:43 timetrex1 kernel: [108682.060119]  0000000000000001 ffff88003595c758 0000000000000000 ffff8800b8c59948
Oct 15 13:48:43 timetrex1 kernel: [108682.060121] Call Trace:
Oct 15 13:48:43 timetrex1 kernel: [108682.060139]  [<ffffffff8171bd94>] dump_stack+0x45/0x56
Oct 15 13:48:43 timetrex1 kernel: [108682.060148]  [<ffffffff810676cd>] warn_slowpath_common+0x7d/0xa0
Oct 15 13:48:43 timetrex1 kernel: [108682.060150]  [<ffffffff8106773c>] warn_slowpath_fmt+0x4c/0x50
Oct 15 13:48:43 timetrex1 kernel: [108682.060155]  [<ffffffff8171f853>] ? schedule_hrtimeout_range+0x13/0x20
Oct 15 13:48:43 timetrex1 kernel: [108682.060158]  [<ffffffffa00ca012>] qxl_sync_obj_wait+0x172/0x1f0 [qxl]
Oct 15 13:48:43 timetrex1 kernel: [108682.060169]  [<ffffffffa00a7781>] ttm_bo_wait+0x91/0x190 [ttm]
Oct 15 13:48:43 timetrex1 kernel: [108682.060173]  [<ffffffffa00a9214>] ttm_bo_evict+0x54/0x340 [ttm]
Oct 15 13:48:43 timetrex1 kernel: [108682.060176]  [<ffffffffa00a96a8>] ttm_mem_evict_first+0x1a8/0x220 [ttm]
Oct 15 13:48:43 timetrex1 kernel: [108682.060180]  [<ffffffffa00a99b0>] ttm_bo_mem_space+0x290/0x340 [ttm]
Oct 15 13:48:43 timetrex1 kernel: [108682.060238]  [<ffffffffa00a9dca>] ttm_bo_move_buffer+0xba/0x130 [ttm]
Oct 15 13:48:43 timetrex1 kernel: [108682.060242]  [<ffffffffa00a9f01>] ttm_bo_validate+0xc1/0x130 [ttm]
Oct 15 13:48:43 timetrex1 kernel: [108682.060246]  [<ffffffffa00aa1d5>] ttm_bo_init+0x265/0x420 [ttm]
Oct 15 13:48:43 timetrex1 kernel: [108682.060249]  [<ffffffffa00cc04e>] qxl_bo_create+0x14e/0x1d0 [qxl]
Oct 15 13:48:43 timetrex1 kernel: [108682.060253]  [<ffffffffa00cbdb0>] ? qxl_fbdev_qobj_is_fb+0x30/0x30 [qxl]
Oct 15 13:48:43 timetrex1 kernel: [108682.060256]  [<ffffffffa00cc777>] qxl_gem_object_create+0x57/0x100 [qxl]
Oct 15 13:48:43 timetrex1 kernel: [108682.060261]  [<ffffffffa00cc86d>] qxl_gem_object_create_with_handle+0x4d/0x100 [qxl]
Oct 15 13:48:43 timetrex1 kernel: [108682.060264]  [<ffffffffa00cf8ca>] qxl_alloc_ioctl+0x3a/0xa0 [qxl]
Oct 15 13:48:43 timetrex1 kernel: [108682.060271]  [<ffffffff8117ce78>] ? __vma_link_file+0x48/0x80
Oct 15 13:48:43 timetrex1 kernel: [108682.060292]  [<ffffffffa001cc22>] drm_ioctl+0x502/0x630 [drm]
Oct 15 13:48:43 timetrex1 kernel: [108682.060299]  [<ffffffff8104f28f>] ? kvm_clock_read+0x1f/0x30
Oct 15 13:48:43 timetrex1 kernel: [108682.060316]  [<ffffffff8101b7e9>] ? sched_clock+0x9/0x10
Oct 15 13:48:43 timetrex1 kernel: [108682.060322]  [<ffffffff8109d1ad>] ? sched_clock_local+0x1d/0x80
Oct 15 13:48:43 timetrex1 kernel: [108682.060327]  [<ffffffff811cfd10>] do_vfs_ioctl+0x2e0/0x4c0
Oct 15 13:48:43 timetrex1 kernel: [108682.060329]  [<ffffffff8109ddf4>] ? vtime_account_user+0x54/0x60
Oct 15 13:48:43 timetrex1 kernel: [108682.060331]  [<ffffffff811cff71>] SyS_ioctl+0x81/0xa0
Oct 15 13:48:43 timetrex1 kernel: [108682.060336]  [<ffffffff8172c97f>] tracesys+0xe1/0xe6
Oct 15 13:48:43 timetrex1 kernel: [108682.060337] ---[ end trace 5f3bb71d5aebd12e ]---
Oct 15 13:48:43 timetrex1 kernel: [108682.060339] [TTM] Failed to expire sync object before buffer eviction
Oct 15 13:48:43 timetrex1 kernel: [108682.060351] qxl 0000:00:02.0: object_init failed for (262144, 0x00000001)
Oct 15 13:48:43 timetrex1 kernel: [108682.060353] [drm:qxl_gem_object_create] *ERROR* Failed to allocate GEM object (258964, 1, 4096, -12)
Oct 15 13:48:43 timetrex1 kernel: [108682.060355] [drm:qxl_alloc_ioctl] *ERROR* qxl_alloc_ioctl: failed to create gem ret=-12

On 10/14/2014 08:07 AM, David Mansfield wrote:
Hi All,

I've been poking around trying to find the cause of an intermittent
spice kms crash, which I think is the same as these three bugzilla bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1038275
https://bugzilla.redhat.com/show_bug.cgi?id=1050034
https://bugzilla.redhat.com/show_bug.cgi?id=1039063

and an email I sent on 11 Sep. 2014 (Subject: qxl dri related kernel oops).

The code in question seems to be looping for about 150-300ms (in
qxl_ttm.c:qxl_sync_obj_wait) and if I've understood the architecture
correctly, it's waiting for the "host" side of QXL to release some objects.

Also, in qxl_fence.c, there is this comment:

    For some reason every so often qxl hw fails to release, things go
wrong.

In particular, my colleague hits this semi-regularly when using spice
via a mobile-tether: i.e. high jitter, high latency and periodic network
dropouts.

Does the "qxl hw" have a discard policy when the client is "missing" and
is the 150-300ms timing loop in the guest based on anything?

--
Mike
_______________________________________________
Spice-devel mailing list
Spice-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/spice-devel