Re: [PATCH] parisc: Try to fix random segmentation faults in package builds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-05-29 12:33, John David Anglin wrote:
On 2024-05-29 11:54 a.m., matoro wrote:
On 2024-05-09 13:10, John David Anglin wrote:
On 2024-05-08 4:52 p.m., John David Anglin wrote:
with no accompanying stack trace and then the BMC would restart the whole machine automatically. These were infrequent enough that the segfaults were the bigger problem, but after applying this patch on top of 6.8, this changed the dynamic.  It seems to occur during builds with varying I/O loads.  For example, I was able to build gcc fine, with no segfaults, but I was unable to build perl, a much smaller build, without crashing the machine. I did not observe any segfaults over the day or 2 I ran this patch, but that's not an unheard-of stretch of time even without it, and I am being forced to revert because of the panics.
Looks like there is a problem with 6.8.  I'll do some testing with it.
So far, I haven't seen any panics with 6.8.9 but I have seen some random segmentation faults in the gcc testsuite.  I looked at one ld fault in some detail. 18 contiguous words in the  elf_link_hash_entry struct were zeroed starting with the last word in the bfd_link_hash_entry struct causing the fault.
The section pointer was zeroed.

18 words is a rather strange number of words to corrupt and corruption doesn't seem related
to object structure.  In any case, it is not page related.

It's really hard to tell how this happens.  The corrupt object was at a slightly different location
than it is when ld is run under gdb.  Can't duplicate in gdb.

Dave

Dave, not sure how much testing you have done with current mainline kernels, but I've had to temporarily give up on 6.8 and 6.9 for now, as most heavy builds quickly hit that kernel panic. 6.6 does not seem to have the problem though.  The patch from this thread does not seem to have made a difference one way or the other w.r.t. segfaults.
My latest patch is looking good.  I have 6 days of testing on c8000 (1 GHz PA8800) with 6.8.10 and 6.8.11, and I haven't had any random segmentation faults.  System has been building debian packages.  In addition, it has been building and testing gcc.  It's on its third gcc build and check with patch.

The latest version uses lpa_user() with fallback to page table search in flush_cache_page_if_present() to obtain physical page address. It revises copy_to_user_page() and copy_from_user_page() to flush kernel mapping with tmpalias flushes.  copy_from_user_page() was missing kernel mapping flush.  flush_cache_vmap() and flush_cache_vunmap() are moved into cache.c.  TLB is now flushed before cache flush to inhibit move-in in these routines. flush_cache_vmap() now handles small VM_IOREMAP flushes instead of flushing
entire cache.  This latter change is an optimization.

If random faults are still present, I believe we will have to give up trying to optimize flush_cache_mm() and flush_cache_range() and
flush the whole cache in these routines.

Some work would be needed to backport my current patch to longterm kernels because of folio changes in 6.8.

Dave

Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm seeing improvement! No panics yet, I have a couple week's worth of package testing to catch up on so I'll report if I see anything!




[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux