On 2024-05-29 12:33, John David Anglin wrote:
On 2024-05-29 11:54 a.m., matoro wrote:
On 2024-05-09 13:10, John David Anglin wrote:
On 2024-05-08 4:52 p.m., John David Anglin wrote:
with no accompanying stack trace and then the BMC would restart the
whole machine automatically. These were infrequent enough that the
segfaults were the bigger problem, but after applying this patch on top
of 6.8, this changed the dynamic. It seems to occur during builds with
varying I/O loads. For example, I was able to build gcc fine, with no
segfaults, but I was unable to build perl, a much smaller build, without
crashing the machine. I did not observe any segfaults over the day or 2
I ran this patch, but that's not an unheard-of stretch of
time even without it, and I am being forced to revert because of the panics.
Looks like there is a problem with 6.8. I'll do some testing with it.
So far, I haven't seen any panics with 6.8.9 but I have seen some random
segmentation faults
in the gcc testsuite. I looked at one ld fault in some detail. 18
contiguous words in the elf_link_hash_entry
struct were zeroed starting with the last word in the bfd_link_hash_entry
struct causing the fault.
The section pointer was zeroed.
18 words is a rather strange number of words to corrupt and corruption
doesn't seem related
to object structure. In any case, it is not page related.
It's really hard to tell how this happens. The corrupt object was at a
slightly different location
than it is when ld is run under gdb. Can't duplicate in gdb.
Dave
Dave, not sure how much testing you have done with current mainline
kernels, but I've had to temporarily give up on 6.8 and 6.9 for now, as
most heavy builds quickly hit that kernel panic. 6.6 does not seem to have
the problem though. The patch from this thread does not seem to have made
a difference one way or the other w.r.t. segfaults.
My latest patch is looking good. I have 6 days of testing on c8000 (1 GHz
PA8800) with 6.8.10 and 6.8.11, and I haven't had any random segmentation
faults. System has been building debian packages. In addition, it has been
building and testing gcc. It's on its third gcc build and check with patch.
The latest version uses lpa_user() with fallback to page table search in
flush_cache_page_if_present() to obtain physical page address.
It revises copy_to_user_page() and copy_from_user_page() to flush kernel
mapping with tmpalias flushes. copy_from_user_page()
was missing kernel mapping flush. flush_cache_vmap() and
flush_cache_vunmap() are moved into cache.c. TLB is now flushed before
cache flush to inhibit move-in in these routines. flush_cache_vmap() now
handles small VM_IOREMAP flushes instead of flushing
entire cache. This latter change is an optimization.
If random faults are still present, I believe we will have to give up trying
to optimize flush_cache_mm() and flush_cache_range() and
flush the whole cache in these routines.
Some work would be needed to backport my current patch to longterm kernels
because of folio changes in 6.8.
Dave
Thanks a ton Dave, I've applied this on top of 6.9.2 and also think I'm
seeing improvement! No panics yet, I have a couple week's worth of package
testing to catch up on so I'll report if I see anything!