Hi Dave, On Tue, 10 Feb 2015 08:24:20 +1100 Dave Chinner wrote: > On Mon, Feb 09, 2015 at 09:47:01AM +0100, Bruno Prémont wrote: > > On Fri, 6 Feb 2015 09:15:16 +1100 Dave Chinner wrote: > > > On Thu, Feb 05, 2015 at 03:10:07PM +0100, Bruno Prémont wrote: > > > > New crash, new trace, this time on 3.18.2. > > > > It looks like this time a NULL dereference happened prior to touched memory poison being detected. > > > > > > > > Once again it's during normal system operation (no mount/umount activity) > > > > > > Can you rebuild the kernel with CONFIG_XFS_WARN=y and see if that > > > throws any interesting messages into logs? > > > > Will try and see > > > > > However: > > > > > > > [1900390.261491] ============================================================================= > > > > [1900390.272989] BUG task_struct (Tainted: G D W ): Poison overwritten > > > > [1900390.283021] ----------------------------------------------------------------------------- > > > > [1900390.283021] > > > > [1900390.297056] INFO: 0xffff880213d651b3-0xffff880213d651b3. First byte 0x6d instead of 0x6b > > > > [1900390.309044] INFO: Slab 0xffffea00084f5800 objects=16 used=16 fp=0x (null) flags=0x8000000000004080 > > > > [1900390.323087] INFO: Object 0xffff880213d64ba0 @offset=19360 fp=0xffff880213d61e40 > > > > [1900390.323087] > > > > [1900390.336988] Bytes b4 ffff880213d64b90: 60 2d d6 13 02 88 ff ff 5a 5a 5a 5a 5a 5a 5a 5a `-......ZZZZZZZZ > > > > [1900390.350988] Object ffff880213d64ba0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > > > > [1900390.364943] Object ffff880213d64bb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > > > .... > > > > [1900391.674636] Object ffff880213d651b0: 6b 6b 6b 6d 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkmkkkkkkkkkkkk > > > ^^ > > > > > > There's a single bit that has been flipped in the task_struct slab. > > > So more than just XFS is seeing memory corruption - this is in core > > > kernel structure slab caches. I'm not sure, either, how XFS could > > > cause corruption in this slab. > > > > > > So, I'd be checking all the previous memory corruptions to see if > > > they are single bit errors, and if there is any pattern to the > > > addresses at which they occur. The above bit flip makes me think > > > "hardware issue" and everything else stems from that... > > > > System has ECC RAM so faulty RAM looks less probable (no complaint seen > > by kernel nor recorded by firmware). > > Sure, but that's not the only hardware in the memory path so single > bit errors can occur elsewhere as data moved across the bus of sits > in cpu caches. and if you're not using an IOMMU then it could even > be hardware writing to memory incorrectly... > > > All previous crashes for which I have some logs were dereference after > > free but not attempt to allocate memory from a modified poison in free > > slabs. > > > > Though what does that single bit represent in that area if it was > > used/modified after free? > > It means that there's either a use after free, or you have a > hardware problem. being in the task struct slab, if it's a use after > free then it's unlikely to be an XFS problem. I mean what field does the affected byte/bit belong to in task_struct in order to see if it could be some write-after-free (of a task_struct) or not. > FWIW, can you post the output of "grep PARAVIRT <kernel config > file>"? grep does not find any match (full config, prior to enabling XFS_WARN attached). Cheers, Bruno
Attachment:
xfs.config
Description: Binary data
_______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs