Re: [Bisected] Corruption of root fs during git bisect of drm system hang

Markus Trippelsdorf <markus@xxxxxxxxxxxxxxx> · Fri, 19 Jul 2013 18:32:20 +0200

On 2013.07.19 at 11:02 -0500, Eric Sandeen wrote:
> On 7/19/13 7:51 AM, Markus Trippelsdorf wrote:
> > On 2013.07.19 at 14:41 +0200, Stefan Ring wrote:
> >>> I've bisected this issue to the following commit:
> >>>
> >>>  commit cca9f93a52d2ead50b5da59ca83d5f469ee4be5f
> >>>  Author: Dave Chinner <dchinner@xxxxxxxxxx>
> >>>  Date:   Thu Jun 27 16:04:49 2013 +1000
> >>>
> >>>      xfs: don't do IO when creating an new inode
> >>>
> >>> Reverting this commit on top of the Linus tree "solves" all problems for
> >>> me. IOW I no longer loose my KDE and LibreOffice config files during a
> >>> crash. Log recovery now works fine and xfs_repair shows no issues.
> >>>
> >>> So users of 3.11.0-rc1 beware. Only run this version if you have
> >>> up-to-date backups handy.
> 
> Are you certain about that bisection point?  All that does is
> say:  When we allocate a new inode, assign it a random generation
> number, rather than reading it from disk & incrementing the
> older generation number, AFAICS.  So it simply avoids a read IO.

Yes, I'm sure. 
As I wrote above I also double-checked by reverting the commit on top of
the current Linus tree.

> I wonder if simply changing IO patterns on the SSD changes how
> it's doing caching & destaging <handwave>.

No. The corruption also happens on my conventional (spinning) drives.

> >> What I miss in this thread is a distinction between filesystem
> >> corruption on the one hand and a few zeroed files on the other. The
> >> latter may be a nuisance, but it is expected behavior, while the
> >> former should never happen, period, if I'm not mistaken.
> > 
> > Well, it is natural that fs developers at first try to blame userspace.
> 
> I disagree with that, we just need to be clear about your scenarios,
> and what integrity guarantees should apply.
> 
> > Unfortunately it turned out that in this case there is filesystem
> > corruption. (Fortunately this normally happens only very rarely on rc1
> > kernels).
> 
> Corruption is when you get back data that you did not write,
> or metadata which is inconsistent or unreadable even after a proper
> log replay.
> 
> Corruption is _not_ unsynced, buffered data that was lost on a
> crash or poweroff.
> 
> But I might not have followed the thread properly, and I might
> misunderstand your situation.
> 
> When you experience this lost file [data] scenario, was it after an
> orderly reboot, or after a crash and/or system reset?

To reproduce this issue simply boot into your desktop and then hit
sysrq-c and reboot. After log replay without error messages, the
filesystem is in an inconsistent state and many small config files are
lost. There are also undeletable files. You need to run xfs_repair
manually to bring the filesystem back to normal.

When cca9f93a52d is reverted, you don't loose your config files and the
filesystem is OK after log replay. xfs_repair reports no issues at all.

-- 
Markus

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs