Re: [PATCH] parisc: Try to fix random segmentation faults in package builds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-05-30 01:00, matoro wrote:
On 2024-05-29 12:33, John David Anglin wrote:
On 2024-05-29 11:54 a.m., matoro wrote:
On 2024-05-09 13:10, John David Anglin wrote:
On 2024-05-08 4:52 p.m., John David Anglin wrote:
with no accompanying stack trace and then the BMC would restart the whole machine automatically. These were infrequent enough that the segfaults were the bigger problem, but after applying this patch on top of 6.8, this changed the dynamic.  It seems to occur during builds with varying I/O loads.  For example, I was able to build gcc fine, with no segfaults, but I was unable to build perl, a much smaller build, without crashing the machine. I did not observe any segfaults over the day or 2 I ran this patch, but that's not an unheard-of stretch of time even without it, and I am being forced to revert because of the panics.
Looks like there is a problem with 6.8.  I'll do some testing with it.
So far, I haven't seen any panics with 6.8.9 but I have seen some random segmentation faults in the gcc testsuite.  I looked at one ld fault in some detail. 18 contiguous words in the  elf_link_hash_entry struct were zeroed starting with the last word in the bfd_link_hash_entry struct causing the fault.
The section pointer was zeroed.

18 words is a rather strange number of words to corrupt and corruption doesn't seem related
to object structure.  In any case, it is not page related.

It's really hard to tell how this happens.  The corrupt object was at a slightly different location
than it is when ld is run under gdb.  Can't duplicate in gdb.

Dave

Dave, not sure how much testing you have done with current mainline kernels, but I've had to temporarily give up on 6.8 and 6.9 for now, as most heavy builds quickly hit that kernel panic. 6.6 does not seem to have the problem though.  The patch from this thread does not seem to have made a difference one way or the other w.r.t. segfaults.
My latest patch is looking good.  I have 6 days of testing on c8000 (1 GHz PA8800) with 6.8.10 and 6.8.11, and I haven't had any random segmentation faults.  System has been building debian packages.  In addition, it has been building and testing gcc.  It's on its third gcc build and check with patch.

The latest version uses lpa_user() with fallback to page table search in flush_cache_page_if_present() to obtain physical page address. It revises copy_to_user_page() and copy_from_user_page() to flush kernel mapping with tmpalias flushes.  copy_from_user_page() was missing kernel mapping flush.  flush_cache_vmap() and flush_cache_vunmap() are moved into cache.c.  TLB is now flushed before cache flush to inhibit move-in in these routines. flush_cache_vmap() now handles small VM_IOREMAP flushes instead of flushing
entire cache.  This latter change is an optimization.

If random faults are still present, I believe we will have to give up trying to optimize flush_cache_mm() and flush_cache_range() and
flush the whole cache in these routines.

Some work would be needed to backport my current patch to longterm kernels because of folio changes in 6.8.

Dave

Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm seeing improvement! No panics yet, I have a couple week's worth of package testing to catch up on so I'll report if I see anything!

I've seen a few warnings in my dmesg while testing, although I didn't see any immediately corresponding failures. Any danger?

[Sun Jun  2 18:46:29 2024] ------------[ cut here ]------------
[Sun Jun 2 18:46:29 2024] WARNING: CPU: 0 PID: 26808 at arch/parisc/kernel/cache.c:624 flush_cache_page_if_present+0x1a4/0x330 [Sun Jun 2 18:46:29 2024] Modules linked in: raw_diag tcp_diag inet_diag netlink_diag unix_diag nfnetlink overlay loop nfsv4 dns_resolver nfs lockd grace sunrpc netfs autofs4 binfmt_misc sr_mod ohci_pci cdrom ehci_pci ohci_hcd ehci_hcd tg3 pata_cmd64x usbcore ipmi_si hwmon usb_common
libata libphy ipmi_devintf nls_base ipmi_msghandler
[Sun Jun 2 18:46:29 2024] CPU: 0 PID: 26808 Comm: bash Tainted: G W 6.9.3-gentoo-parisc64 #1
[Sun Jun  2 18:46:29 2024] Hardware name: 9000/800/rp3440

[Sun Jun  2 18:46:29 2024]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[Sun Jun 2 18:46:29 2024] PSW: 00001000000001101111100100001111 Tainted: G W [Sun Jun 2 18:46:29 2024] r00-03 000000ff0806f90f 000000004106b280 00000000402090bc 000000005160c6a0 [Sun Jun 2 18:46:29 2024] r04-07 0000000040f99a80 00000000f96da000 00000001659a2360 000000000800000f [Sun Jun 2 18:46:29 2024] r08-11 0000000c0063f89c 0000000000000000 000000004ce09e9c 000000005160c5a8 [Sun Jun 2 18:46:29 2024] r12-15 000000004ce09eb0 00000000414ebd70 0000000041687768 0000000041646830 [Sun Jun 2 18:46:29 2024] r16-19 00000000516333c0 0000000001200000 00000001c36be780 0000000000000003 [Sun Jun 2 18:46:29 2024] r20-23 0000000000001a46 000000000f584000 ffffffffc0000000 000000000000000f [Sun Jun 2 18:46:29 2024] r24-27 0000000000000000 000000000800000f 000000004ce09ea0 0000000040f99a80 [Sun Jun 2 18:46:29 2024] r28-31 0000000000000000 000000005160c720 000000005160c750 0000000000000000 [Sun Jun 2 18:46:29 2024] sr00-03 00000000052be800 00000000052be800 0000000000000000 00000000052be800 [Sun Jun 2 18:46:29 2024] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000

[Sun Jun 2 18:46:29 2024] IASQ: 0000000000000000 0000000000000000 IAOQ: 0000000040209104 0000000040209108 [Sun Jun 2 18:46:29 2024] IIR: 03ffe01f ISR: 0000000010240000 IOR: 0000003382609ea0 [Sun Jun 2 18:46:29 2024] CPU: 0 CR30: 00000000516333c0 CR31: fffffff0f0e05ee0
[Sun Jun  2 18:46:29 2024]  ORIG_R28: 000000005160c7b0
[Sun Jun  2 18:46:29 2024]  IAOQ[0]: flush_cache_page_if_present+0x1a4/0x330
[Sun Jun  2 18:46:29 2024]  IAOQ[1]: flush_cache_page_if_present+0x1a8/0x330
[Sun Jun  2 18:46:29 2024]  RP(r2): flush_cache_page_if_present+0x15c/0x330
[Sun Jun  2 18:46:29 2024] Backtrace:
[Sun Jun  2 18:46:29 2024]  [<000000004020afb8>] flush_cache_mm+0x1a8/0x1c8
[Sun Jun  2 18:46:29 2024]  [<000000004023cf3c>] copy_mm+0x2a8/0xfd0
[Sun Jun  2 18:46:29 2024]  [<0000000040241040>] copy_process+0x1684/0x26e8
[Sun Jun  2 18:46:29 2024]  [<0000000040242218>] kernel_clone+0xcc/0x754
[Sun Jun  2 18:46:29 2024]  [<0000000040242908>] __do_sys_clone+0x68/0x80
[Sun Jun  2 18:46:29 2024]  [<0000000040242d14>] sys_clone+0x30/0x60
[Sun Jun  2 18:46:29 2024]  [<0000000040203fbc>] syscall_exit+0x0/0x10

[Sun Jun  2 18:46:29 2024] ---[ end trace 0000000000000000 ]---




[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux