On 2024-05-30 01:00, matoro wrote:
On 2024-05-29 12:33, John David Anglin wrote:
On 2024-05-29 11:54 a.m., matoro wrote:
On 2024-05-09 13:10, John David Anglin wrote:
On 2024-05-08 4:52 p.m., John David Anglin wrote:
with no accompanying stack trace and then the BMC would restart the
whole machine automatically. These were infrequent enough that the
segfaults were the bigger problem, but after applying this patch on top
of 6.8, this changed the dynamic. It seems to occur during builds with
varying I/O loads. For example, I was able to build gcc fine, with no
segfaults, but I was unable to build perl, a much smaller build,
without crashing the machine. I did not observe any segfaults over the
day or 2 I ran this patch, but that's not an unheard-of stretch of
time even without it, and I am being forced to revert because of the panics.
Looks like there is a problem with 6.8. I'll do some testing with it.
So far, I haven't seen any panics with 6.8.9 but I have seen some random
segmentation faults
in the gcc testsuite. I looked at one ld fault in some detail. 18
contiguous words in the elf_link_hash_entry
struct were zeroed starting with the last word in the bfd_link_hash_entry
struct causing the fault.
The section pointer was zeroed.
18 words is a rather strange number of words to corrupt and corruption
doesn't seem related
to object structure. In any case, it is not page related.
It's really hard to tell how this happens. The corrupt object was at a
slightly different location
than it is when ld is run under gdb. Can't duplicate in gdb.
Dave
Dave, not sure how much testing you have done with current mainline
kernels, but I've had to temporarily give up on 6.8 and 6.9 for now, as
most heavy builds quickly hit that kernel panic. 6.6 does not seem to have
the problem though. The patch from this thread does not seem to have made
a difference one way or the other w.r.t. segfaults.
My latest patch is looking good. I have 6 days of testing on c8000 (1 GHz
PA8800) with 6.8.10 and 6.8.11, and I haven't had any random segmentation
faults. System has been building debian packages. In addition, it has
been building and testing gcc. It's on its third gcc build and check with
patch.
The latest version uses lpa_user() with fallback to page table search in
flush_cache_page_if_present() to obtain physical page address.
It revises copy_to_user_page() and copy_from_user_page() to flush kernel
mapping with tmpalias flushes. copy_from_user_page()
was missing kernel mapping flush. flush_cache_vmap() and
flush_cache_vunmap() are moved into cache.c. TLB is now flushed before
cache flush to inhibit move-in in these routines. flush_cache_vmap() now
handles small VM_IOREMAP flushes instead of flushing
entire cache. This latter change is an optimization.
If random faults are still present, I believe we will have to give up
trying to optimize flush_cache_mm() and flush_cache_range() and
flush the whole cache in these routines.
Some work would be needed to backport my current patch to longterm kernels
because of folio changes in 6.8.
Dave
Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm
seeing improvement! No panics yet, I have a couple week's worth of package
testing to catch up on so I'll report if I see anything!
I've seen a few warnings in my dmesg while testing, although I didn't see any
immediately corresponding failures. Any danger?
[Sun Jun 2 18:46:29 2024] ------------[ cut here ]------------
[Sun Jun 2 18:46:29 2024] WARNING: CPU: 0 PID: 26808 at
arch/parisc/kernel/cache.c:624 flush_cache_page_if_present+0x1a4/0x330
[Sun Jun 2 18:46:29 2024] Modules linked in: raw_diag tcp_diag inet_diag
netlink_diag unix_diag nfnetlink overlay loop nfsv4 dns_resolver nfs
lockd grace sunrpc netfs autofs4 binfmt_misc sr_mod ohci_pci cdrom ehci_pci
ohci_hcd ehci_hcd tg3 pata_cmd64x usbcore ipmi_si hwmon usb_common
libata libphy ipmi_devintf nls_base ipmi_msghandler
[Sun Jun 2 18:46:29 2024] CPU: 0 PID: 26808 Comm: bash Tainted: G W
6.9.3-gentoo-parisc64 #1
[Sun Jun 2 18:46:29 2024] Hardware name: 9000/800/rp3440
[Sun Jun 2 18:46:29 2024] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[Sun Jun 2 18:46:29 2024] PSW: 00001000000001101111100100001111 Tainted: G
W
[Sun Jun 2 18:46:29 2024] r00-03 000000ff0806f90f 000000004106b280
00000000402090bc 000000005160c6a0
[Sun Jun 2 18:46:29 2024] r04-07 0000000040f99a80 00000000f96da000
00000001659a2360 000000000800000f
[Sun Jun 2 18:46:29 2024] r08-11 0000000c0063f89c 0000000000000000
000000004ce09e9c 000000005160c5a8
[Sun Jun 2 18:46:29 2024] r12-15 000000004ce09eb0 00000000414ebd70
0000000041687768 0000000041646830
[Sun Jun 2 18:46:29 2024] r16-19 00000000516333c0 0000000001200000
00000001c36be780 0000000000000003
[Sun Jun 2 18:46:29 2024] r20-23 0000000000001a46 000000000f584000
ffffffffc0000000 000000000000000f
[Sun Jun 2 18:46:29 2024] r24-27 0000000000000000 000000000800000f
000000004ce09ea0 0000000040f99a80
[Sun Jun 2 18:46:29 2024] r28-31 0000000000000000 000000005160c720
000000005160c750 0000000000000000
[Sun Jun 2 18:46:29 2024] sr00-03 00000000052be800 00000000052be800
0000000000000000 00000000052be800
[Sun Jun 2 18:46:29 2024] sr04-07 0000000000000000 0000000000000000
0000000000000000 0000000000000000
[Sun Jun 2 18:46:29 2024] IASQ: 0000000000000000 0000000000000000 IAOQ:
0000000040209104 0000000040209108
[Sun Jun 2 18:46:29 2024] IIR: 03ffe01f ISR: 0000000010240000 IOR:
0000003382609ea0
[Sun Jun 2 18:46:29 2024] CPU: 0 CR30: 00000000516333c0 CR31:
fffffff0f0e05ee0
[Sun Jun 2 18:46:29 2024] ORIG_R28: 000000005160c7b0
[Sun Jun 2 18:46:29 2024] IAOQ[0]: flush_cache_page_if_present+0x1a4/0x330
[Sun Jun 2 18:46:29 2024] IAOQ[1]: flush_cache_page_if_present+0x1a8/0x330
[Sun Jun 2 18:46:29 2024] RP(r2): flush_cache_page_if_present+0x15c/0x330
[Sun Jun 2 18:46:29 2024] Backtrace:
[Sun Jun 2 18:46:29 2024] [<000000004020afb8>] flush_cache_mm+0x1a8/0x1c8
[Sun Jun 2 18:46:29 2024] [<000000004023cf3c>] copy_mm+0x2a8/0xfd0
[Sun Jun 2 18:46:29 2024] [<0000000040241040>] copy_process+0x1684/0x26e8
[Sun Jun 2 18:46:29 2024] [<0000000040242218>] kernel_clone+0xcc/0x754
[Sun Jun 2 18:46:29 2024] [<0000000040242908>] __do_sys_clone+0x68/0x80
[Sun Jun 2 18:46:29 2024] [<0000000040242d14>] sys_clone+0x30/0x60
[Sun Jun 2 18:46:29 2024] [<0000000040203fbc>] syscall_exit+0x0/0x10
[Sun Jun 2 18:46:29 2024] ---[ end trace 0000000000000000 ]---