Re: [PATCH] parisc: Try to fix random segmentation faults in package builds

matoro <matoro_mailinglist_kernel@xxxxxxxxx> · Thu, 30 May 2024 01:00:18 -0400

On 2024-05-29 12:33, John David Anglin wrote:
On 2024-05-29 11:54 a.m., matoro wrote:
On 2024-05-09 13:10, John David Anglin wrote:
On 2024-05-08 4:52 p.m., John David Anglin wrote:
with no accompanying stack trace and then the BMC would restart the 
whole machine automatically. These were infrequent enough that the 
segfaults were the bigger problem, but after applying this patch on top 
of 6.8, this changed the dynamic.  It seems to occur during builds with 
varying I/O loads.  For example, I was able to build gcc fine, with no 
segfaults, but I was unable to build perl, a much smaller build, without 
crashing the machine. I did not observe any segfaults over the day or 2 
I ran this patch, but that's not an unheard-of stretch of 
time even without it, and I am being forced to revert because of the panics.
Looks like there is a problem with 6.8.  I'll do some testing with it.
So far, I haven't seen any panics with 6.8.9 but I have seen some random 
segmentation faults
in the gcc testsuite.  I looked at one ld fault in some detail. 18 
contiguous words in the  elf_link_hash_entry
struct were zeroed starting with the last word in the bfd_link_hash_entry 
struct causing the fault.
The section pointer was zeroed.

18 words is a rather strange number of words to corrupt and corruption 
doesn't seem related
to object structure.  In any case, it is not page related.

It's really hard to tell how this happens.  The corrupt object was at a 
slightly different location
than it is when ld is run under gdb.  Can't duplicate in gdb.

Dave

Dave, not sure how much testing you have done with current mainline 
kernels, but I've had to temporarily give up on 6.8 and 6.9 for now, as 
most heavy builds quickly hit that kernel panic. 6.6 does not seem to have 
the problem though.  The patch from this thread does not seem to have made 
a difference one way or the other w.r.t. segfaults.
My latest patch is looking good.  I have 6 days of testing on c8000 (1 GHz 
PA8800) with 6.8.10 and 6.8.11, and I haven't had any random segmentation
faults.  System has been building debian packages.  In addition, it has been 
building and testing gcc.  It's on its third gcc build and check with patch.

The latest version uses lpa_user() with fallback to page table search in 
flush_cache_page_if_present() to obtain physical page address.
It revises copy_to_user_page() and copy_from_user_page() to flush kernel 
mapping with tmpalias flushes.  copy_from_user_page()
was missing kernel mapping flush.  flush_cache_vmap() and 
flush_cache_vunmap() are moved into cache.c.  TLB is now flushed before
cache flush to inhibit move-in in these routines. flush_cache_vmap() now 
handles small VM_IOREMAP flushes instead of flushing
entire cache.  This latter change is an optimization.

If random faults are still present, I believe we will have to give up trying 
to optimize flush_cache_mm() and flush_cache_range() and
flush the whole cache in these routines.

Some work would be needed to backport my current patch to longterm kernels 
because of folio changes in 6.8.

Dave

Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm 
seeing improvement!  No panics yet, I have a couple week's worth of package 
testing to catch up on so I'll report if I see anything!