On Mon, May 23, 2016 at 02:57:30PM -0400, Joshua Kinard wrote: > NAK, this issue looks completely different to IP30/IP27. In this case, it > looks like the hardware is detecting the case where multiple TLB entries match > and it's killing the machine to avoid hardware damage. I don't want to know > how the SGI systems handle this scenario (does the R10000 do a TLB shutdown??). The R10000 detects if duplicate entries when writing to the TLB and invalidates the previous entry. That is, there will never be duplicate entries in the TLB and of course no TLB shutdown. That's the theory. I'm wondering how well that is going to work if the entries are having a different page size. And Aaro doesn't always get machine checks so it's not like always a duplicate entry is written. > On IP30, using THP usually results in instruction bus errors (IBE), after a set > time, depending on the machine's configuration (<2GB RAM, virtually instant on > userland init; >2GB RAM, might survive for a few minutes, even getting all the > way to runlevel 3 randomly). > > IP27 was somewhat similar to IP30, in that THP usually results in IBEs after a > few seconds of hitting userland bringup (bash is pretty quick at triggering an > IBE), but I haven't tried experimenting with varying the amount of RAM in that > machine, due to the fragility of pulling the nodeboards out constantly. I also > haven't tried THP since refactoring/rewriting the IP27 code back in Feb to see > if I magically fixed it... Ralf