Re: crashes in 4.10 because of "parisc: Enable KASLR"

John David Anglin <dave.anglin@xxxxxxxx> · Sun, 10 Dec 2017 16:42:50 -0500

On 2017-12-08, at 6:22 AM, Mikulas Patocka wrote:

> 
> On Wed, 1 Feb 2017, John David Anglin wrote:
> 
>> On 2017-02-01 3:10 PM, Mikulas Patocka wrote:
>>>> I'm not 100% convinced that 4.9 is fully stable and that the patch
>>>> is the reason for the crashes you see.
>>>> What kind of crashes do you see? Userspace or kernel ?
>>> Userspace crashes. Random crashes or internal errors in gcc when compiling
>>> the kernel. I once had "aptitude" crash.
>> The userspace crashes are present in 4.8 and 4.9 as well.  For example, this
>> build failed due to an OS problem:
>> https://buildd.debian.org/status/fetch.php?pkg=kdenlive&arch=hppa&ver=16.12.1-2&stamp=1485956026&raw=0
>> 
>> Probably, 10% or more large packages fail to build because of this. Note that
>> this only occurs on machines
>> (e.g., c8000) that only support equivalent aliases.  We don't see this on the
>> parisc buildd which has two PA8600 CPUs.
>> 
>> My current theory is the following functions are buggy:
>> 
>> /* vmap range flushes and invalidates.  Architecturally, we don't need
>> * the invalidate, because the CPU should refuse to speculate once an
>> * area has been flushed, so invalidate is left empty */
>> static inline void flush_kernel_vmap_range(void *vaddr, int size)
>> {
>>        unsigned long start = (unsigned long)vaddr;
>> 
>>        flush_kernel_dcache_range_asm(start, start + size);
>> }
>> static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>> {
>>        unsigned long start = (unsigned long)vaddr;
>>        void *cursor = vaddr;
>> 
>>        for ( ; cursor < vaddr + size; cursor += PAGE_SIZE) {
>>                struct page *page = vmalloc_to_page(cursor);
>> 
>>                if (test_and_clear_bit(PG_dcache_dirty, &page->flags))
>>                        flush_kernel_dcache_page(page);
>>        }
>>        flush_kernel_dcache_range_asm(start, start + size);
>> }
> 
> BTW. if you flush a cache line, then - according to the pa-risc 
> specification - the page stays in the TLB and the CPU can fetch anything 
> that is in the TLB speculatively. So, such a flush could really have no 
> effect.
> 
> The kernel should first flush TLB for the affected range and then flush 
> the data using the tmpalias mapping.

I agree.  Flushing using the tmpalias mapping handles cache move-in correctly but at the moment
we only have routines to flush whole pages.  I think the big problem is we don't create translations
for non access TLB misses correctly.  See top of page F-11.  We should set access rights to 0 or 1
to prevent I-cache move-in, and the T bit to 1 to prevent D-cache move-in.  As things stands, set up
the TLB entry for non access exceptions the same as we do for normal access exceptions.  As a result,
cache flushes may themselves cause a problem.

I had always wondered why this code is backwards:

void flush_kernel_dcache_page_addr(void *addr)
{
        unsigned long flags;

        flush_kernel_dcache_page_asm(addr);
        purge_tlb_start(flags);
        pdtlb_kernel(addr);
        purge_tlb_end(flags);
}

I did try reversing the order yesterday and it it seemed to increase the number of random segmentation faults.
As it stands, there is a bit of a race between the cache flush and the TLB purge.

Dave
--
John David Anglin	dave.anglin@xxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html