Re: [RFC] m68k: Fix dead code in bus_error030()

Finn Thain <fthain@xxxxxxxxxxxxxxxxxxx> · Sat, 10 Mar 2018 15:54:16 +1100 (AEDT)

On Sat, 10 Mar 2018, Michael Schmitz wrote:

It's a hardware exception, not a software exception. The bus error is 
generated by signals from one of Apple's ASICs. This logic circuit 
effectively interfaces the SCSI bus with the system bus, via the SCSI 
controller, for performance. But that's hardly relevant. I'm more 
interested in the bug in mainline, not the bug in my RFC patch.

Well, the bug in mainline is what allows PDMA to work on 030,

No it isn't. I proposed an RFC patch that made PDMA fail. To me that means 
that the patch has a problem. It doesn't say anything about the existing 
code.

by a happy coincidence. I was simply wondering what MMU status bits we'd 
likely see if the PDMA ASIC generates an exception.

Well, we know that 0 == (mmusr & (MMU_I|MMU_WP|MMU_B|MMU_L|MMU_S)). Do we 
need to know the state of MMU_M and MMU_T also? I assume that the MMU 
status is exactly the same for a SCSI MMIO access regardless of whether or 
not it happens to fault: the exception (when it happens) tells you 
something about the SCSI bus, not the MMU.

- What are the implications of the existing logic error?

We might miss handling (MMU_B|MMU_L|MMU_S) && (ssw & RM) (should be 
harmless),

I think that leads to a recursive fault which would kill the machine 
instead of just the user process that caused the fault, but I don't 
have code to confirm this.

Too dangerous to consider, then.

Dangerous enough to want verification with some exception handler tests, I 
think.

config RMW_INSNS
        bool "Use read-modify-write instructions"
        depends on ADVANCED
        ---help---
          This allows to use certain instructions that work with 
          indivisible read-modify-write bus cycles. While this is faster 
          than the workaround of disabling interrupts, it can conflict 
          with DMA ( = direct memory access)

Makes sense. For RMW accesses the CPU has to lock out other bus masters 
like the DMA controller (or any device that might modify memory somehow).

          on many Amiga systems, and it is also said to destabilize 
          other machines. It is very likely that this will cause serious 
          problems on any Amiga or Atari Medusa if set. The only 
          configuration where it should work are 68030-based Ataris, 
          where it apparently improves performance. But you've been 
          warned! Unless you really know what you are doing, say N. Try 
          Y only if you're quite adventurous.

The comment about Atari Medusa (040 IIRC) might no longer be correct 
after Roman fixed the recursive fault on 040.

I've always played it safe and followed this advice. But I agree that 
there are probably systems besides 68030-based Ataris where RMW is just 
fine. But who knows.

But it still might be true for 030 Amiga, where (as I understand the 
statement) DMA operations may interfere with RMW bus cycles which might 
cause such exceptions.

We could use #ifdef CONFIG_RMW_INSNS in the bus_error030() implementation 
but it won't help against user mode RMW faults, so all of this seems to 
beg the same question: "Why a special case for RMW faults?".

- Should the dead code be deleted because the live algorithm cannot 
be improved upon? (The present algorithm works fine for PDMA for 
example.)

The end result might be the same with the current code (except for 
PDMA working): signal to user process (maybe the wrong one; weird 
access forces SEGV), or panic. The main difference is we don't fix up 
the exception from process exception tables. I believe that is what 
makes PDMA work, i.e. fixes up the PDMA bus fault?

I don't follow. Deleting dead code means no difference. Everything 
works the same (and so PDMA keeps working, but that's a red herring).

What I meant is the end result of the code as-is would be the same 
fixing the logic to take the default branch as we all agree should have 
happened.

I don't see much agreement.

You appear to be in agreement with some part of the RFC patch (?) If so, 
let's respond to that message.

The end result of deleting the dead code is, of course, current 
behaviour, which appears to be just fine and changing that would require 
a lot of testing.

Another valid way of looking at it is that the end result of deleting the 
dead code is either 1) the loss of the only signpost in the source code 
that points to the bug or 2) a claim that there is no bug, just dead code.

Faced with those two possibilities, I made code live and sent an RFC 
instead.

So perhaps try send_fault_signal() in the default branch, which will 
also run die_if_kernel() if need be.

I tried that (Stan tested it). It turns out that usermode instruction 
faults can also traverse that branch so /sbin/init just crashed.

If they happen together with a data fault, that branch will be used. If 
exception fixup was successful, i.e. send_fault_signal() == -1, you 
should not need to signal but continue to instuction fault handling?

AIUI, send_fault_signal() == -1 can only happen in supervisor mode, which 
is not relevant here. (It is relevant to the PDMA crash but that remains a 
red herring.) I still feel that there's little point in pursuing this 
unless people (maintainers) agree that there's a bug to be fixed.

-- 
--
To unsubscribe from this list: send the line "unsubscribe linux-m68k" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html