Re: permanent XFS volume corruption

Eric Sandeen <sandeen@xxxxxxxxxxx> · Mon, 15 May 2017 11:52:37 -0500

On 5/15/17 4:22 AM, Jan Beulich wrote:
>>>> On 12.05.17 at 17:11, <sandeen@xxxxxxxxxxx> wrote:
> 
>>
>> On 5/12/17 10:04 AM, Eric Sandeen wrote:
>>> On 5/12/17 9:09 AM, Jan Beulich wrote:
>>>>>>> On 12.05.17 at 15:56, <sandeen@xxxxxxxxxxx> wrote:
>>>>> On 5/12/17 1:26 AM, Jan Beulich wrote:
>>>>>> So on the earlier instance, where I did run actual repairs (and
>>>>>> indeed multiple of them), the problem re-surfaces every time
>>>>>> I mount the volume again.
>>>>> Ok, what is the exact sequence there, from repair to re-corruption?
>>>> Simply mount the volume after repairing (with or without an
>>>> intermediate reboot) and access respective pieces of the fs
>>>> again. As said, with /var/run affected on that first occasion,
>>>> I couldn't even cleanly boot again without seeing the
>>>> corruption re-surface.
>>>
>>> Mount under what kernel, and access in what way?  I'm looking for a
>>> recipe to reproduce what you've seen using the metadump you've provided.
>>>
>>> However:
>>>
>>> With further testing I see that xfs_repair v3.1.8 /does not/
>>> entirely fix the fs; if I run 3.1.8 and then run upstream repair, it
>>> finds and fixes more bad flags on inode 764 (lib/xenstored/tdb) that 3.1.8
>>> didn't touch.  The verifiers in an upstream kernel may keep tripping
>>> over that until newer repair fixes it...
>>
>> (Indeed just running xfs_repair 3.1.8 finds the same corruption over and 
>> over)
>>
>> Please try a newer xfs_repair, and see if it resolves your problem.
> 
> It seems to have improved the situation (on the first system I had
> the issue on), but leaves me with at least "Operation not permitted"
> upon init scripts (or me manually) rm-ing (or mv-ing) /var/run/*.pid
> (or mv-ing even /var/run itself). I'm not sure how worried I need to
> be, but this surely doesn't look overly healthy yet. The kernel
> warnings are all gone, though.

xfs_repair simply makes the filesystem consistent, it doesn't perform
any other magic.  :)  The corruption we saw was related to incorrect
flags set on an inode - in some cases flags like immutable which can
affect access to the file.

I'm not sure we've made much progress on the root cause of whatever set
those extra flags*, but all repair will do is make them sane from a
filesystem consistency POV, not from an OS operation POV.

Check the files in question with lsattr, and see if there are unexpected
flags still set.

-Eric

* but backing up towards root cause, you said this all started when a 4.11
kernel crashed, and the log replayed?  What kind of crash, what caused it,
what were the messages?

> Jan

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html