Re: [Bug 204407] New: Bad page state in process Xorg

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Fri, 2 Aug 2019 13:23:06 -0700

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 01 Aug 2019 22:34:16 +0000 bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=204407
> 
>             Bug ID: 204407
>            Summary: Bad page state in process Xorg
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 5.3.0-rc2-00013
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Page Allocator
>           Assignee: akpm@xxxxxxxxxxxxxxxxxxxx
>           Reporter: petr@xxxxxxxxxxxxxx
>         Regression: No
> 
> Created attachment 284081
>   --> https://bugzilla.kernel.org/attachment.cgi?id=284081&action=edit
> dmesg
> 
> I've upgraded from 5.3-rc1 to 5.3-rc2, and when I started X server, system
> became unhappy:
> 
> [259701.387365] BUG: Bad page state in process Xorg  pfn:2a300
> [259701.393593] page:ffffea0000a8c000 refcount:0 mapcount:-128
> mapping:0000000000000000 index:0x0
> [259701.402832] flags: 0x2000000000000000()
> [259701.407426] raw: 2000000000000000 ffffffff822ab778 ffffea0000a8f208
> 0000000000000000
> [259701.415900] raw: 0000000000000000 0000000000000003 00000000ffffff7f
> 0000000000000000
> [259701.424373] page dumped because: nonzero mapcount
> [259701.429847] Modules linked in: af_packet xt_REDIRECT nft_compat x_tables
> nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv4 nf_tables
> nfnetlink ppdev parport fuse autofs4 binfmt_misc uinput twofish_generic
> twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common
> camellia_generic camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx_x86_64
> serpent_sse2_x86_64 serpent_generic blowfish_generic blowfish_x86_64
> blowfish_common cast5_avx_x86_64 cast5_generic cast_common des_generic cmac
> xcbc rmd160 af_key xfrm_algo rpcsec_gss_krb5 nfsv4 nls_iso8859_2 cifs libarc4
> nfsv3 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ipv6 crc_ccitt
> nf_defrag_ipv6 snd_hda_codec_hdmi pktcdvd coretemp hwmon intel_rapl_common
> iosf_mbi x86_pkg_temp_thermal snd_hda_codec_realtek snd_hda_codec_generic
> ledtrig_audio snd_hda_intel snd_hda_codec crct10dif_pclmul crc32_pclmul
> snd_hwdep crc32c_intel snd_hda_core ghash_clmulni_intel snd_pcm_oss uas
> iTCO_wdt aesni_intel e1000e
> [259701.429873]  snd_mixer_oss iTCO_vendor_support aes_x86_64 snd_pcm
> crypto_simd ptp mei_me dcdbas lpc_ich sr_mod cryptd usb_storage snd_timer
> glue_helper mfd_core input_leds pps_core tpm_tis cdrom i2c_i801 snd mei
> tpm_tis_core sg tpm [last unloaded: parport_pc]
> [259701.539387] CPU: 10 PID: 4860 Comm: Xorg Tainted: G                T
> 5.3.0-rc2-64-00013-g03f05a670a3d #69
> [259701.549382] Hardware name: Dell Inc. Precision T3610/09M8Y8, BIOS A16
> 02/05/2018
> [259701.549382] Call Trace:
> [259701.549382]  dump_stack+0x46/0x60
> [259701.549382]  bad_page.cold.28+0x81/0xb4
> [259701.549382]  __free_pages_ok+0x236/0x240
> [259701.549382]  __ttm_dma_free_page+0x2f/0x40
> [259701.549382]  ttm_dma_unpopulate+0x29b/0x370
> [259701.549382]  ttm_tt_destroy.part.6+0x44/0x50
> [259701.549382]  ttm_bo_cleanup_memtype_use+0x29/0x70
> [259701.549382]  ttm_bo_put+0x225/0x280
> [259701.549382]  ttm_bo_vm_close+0x10/0x20
> [259701.549382]  remove_vma+0x20/0x40
> [259701.549382]  __do_munmap+0x2da/0x420
> [259701.549382]  __vm_munmap+0x66/0xc0
> [259701.549382]  __x64_sys_munmap+0x22/0x30
> [259701.549382]  do_syscall_64+0x5e/0x1a0
> [259701.549382]  ? prepare_exit_to_usermode+0x75/0xa0
> [259701.549382]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [259701.549382] RIP: 0033:0x7f504d0ec1d7
> [259701.549382] Code: 10 e9 67 ff ff ff 0f 1f 44 00 00 48 8b 15 b1 6c 0c 00 f7
> d8 64 89 02 48 c7 c0 ff ff ff ff e9 6b ff ff ff b8 0b 00 00 00 0f 05 <48> 3d 01
> f0 ff ff 73 01 c3 48 8b 0d 89 6c 0c 00 f7 d8 64 89 01 48
> [259701.549382] RSP: 002b:00007ffe529db138 EFLAGS: 00000206 ORIG_RAX:
> 000000000000000b
> [259701.549382] RAX: ffffffffffffffda RBX: 0000564a5eabce70 RCX:
> 00007f504d0ec1d7
> [259701.549382] RDX: 00007ffe529db140 RSI: 0000000000400000 RDI:
> 00007f5044b65000
> [259701.549382] RBP: 0000564a5eafe460 R08: 000000000000000b R09:
> 000000010283e000
> [259701.549382] R10: 0000000000000001 R11: 0000000000000206 R12:
> 0000564a5e475b08
> [259701.549382] R13: 0000564a5e475c80 R14: 00007ffe529db190 R15:
> 0000000000000c80
> [259701.707238] Disabling lock debugging due to kernel taint

I assume the above is misbehaviour in the DRM code?

> 
> Also - maybe related, maybe not - I've got three userspace crashes earlier on
> this kernel (but never before):
> 
> [77154.886836] iscons.py[12441]: segfault at 2c ip 00000000080cf0b5 sp
> 00000000f773fb60 error 4 in python[8048000+11a000]
> [77154.898376] Code: 02 0f 84 4a 2e 00 00 8b 4d 08 8b bd 04 ff ff ff 8b 59 38
> 8b 57 20 8b 7b 10 85 ff 0f 84 ee 22 00 00 8b 8d 04 ff ff ff 8b 59 08 <8b> 43 2c
> 85 c0 0f 84 3c e3 ff ff 8b 51 34 8b 71 38 8b 79 3c ff 00
> [119529.983163] in.telnetd[616]: segfault at 0 ip 0000555fdfa09a05 sp
> 00007ffd8fc05380 error 4 in in.telnetd[555fdfa06000+b000]
> [119529.995783] Code: 8d 3d c2 76 00 00 e8 9a 2b 00 00 e9 25 fe ff ff be f2 00
> 00 00 48 8d 3d ac 76 00 00 e8 84 2b 00 00 eb 96 48 8b 05 63 ee 00 00 <0f> b6 00
> e9 fa fe ff ff 44 89 ee 48 8d 3d 8c 76 00 00 e8 64 2b 00
> [120884.183003] iscons.py[10779]: segfault at 2c ip 00000000080d0dc2 sp
> 00000000f7702b60 error 4 in python[8048000+11a000]
> [120884.195182] Code: 4c 24 08 89 44 24 04 89 3c 24 e8 d9 0b 01 00 8b 55 f0 8b
> 7d e8 8b 4d ec 8b 85 04 ff ff ff 89 55 84 89 7d 8c 89 4d 88 8b 50 08 <8b> 7a 2c
> 85 ff 0f 84 3c 10 00 00 8b bd 04 ff ff ff 89 f8 8b 57 34
> 
> I've investigated in.telnetd in detail as I was worried if there is some 0-day
> being used on my system - and as far as I could tell, problem is that part of
> the .bss turned to zeroes after process did fork/exec - there was NULL in the
> variable that cannot have NULL as variable is set to non-NULL value during
> in.telnetd initialization.
> 
> iscons crashes are from NULL pointer dereference too, and iscons does lot of
> fork/exec as well.

hm, that does sound unrelated.