Re: Got "Internal error XFS_WANT_CORRUPTED_GOTO". Filesystem needs reformatting to correct issue.

"Carlos E. R." <carlos.e.r@xxxxxxxxxxxx> · Fri, 4 Jul 2014 23:32:26 +0200 (CEST)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[This email has been delayed, while I thought about where to upload 
metadata file - see near the end]

On Thursday, 2014-07-03 at 13:39 -0400, Brian Foster wrote:
On Thu, Jul 03, 2014 at 05:00:47AM +0200, Carlos E. R. wrote:

Ok, so there's a lot going on. I was mainly curious to see what was
causing lingering preallocations, but it could be anything extending a
file multiple times.

Right.

AFAIK, xfsdump can not carry over a filesystem corruption, right?

I think that's accurate, though it might complain/fail in the act of
dumping an fs that is corrupted. The behavior here suggests there might
not be on disk corruption, however.

At least, not a detectable one.

If I don't do that backup-format-restore, I get issues soon, and it 
crashes within a day - I got after booting (the first event):

0.1> 2014-03-15 03:53:47 Telcontar kernel - - - [  301.857523] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line 350 of file /home/abuild/rpmbuild/BUILD/kernel-desktop-3.11.10/linux-3.11/fs/xfs/xfs_all

And some hours later:

<0.1> 2014-03-15 22:20:34 Telcontar kernel - - - [20151.298345] XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 1602 of file /home/abuild/rpmbuild/BUILD/kernel-desktop-3.11.10/linux-3.11/fs/xfs/xfs_allo

It was here that I decided to backup-format-restore instead.

Maybe next time I can take the photo with dd before doing anything else (it
takes about 80 minutes), or simply do an "xfs_metadump", which should be
faster.  And I might not have then 500 GiB of free space to make a dd copy,
anyway.

xfs_metadump should be faster. It will grab the metadata only and
obfuscate filenames so as to hide sensitive information.

Ok, I have a post-it label on the monitor so that I remember - my notes 
are typically stored in the home partition :-)

But the obfuscation is not complete, I can recognize file names:

00008DC0   .leeme.kfPTgt . ....... .2aujzfJ.%;u. .   .0...
00008DF0    .pepe_after_gnome.tar.bz2.vcTJ8c.@.. . .......
00008E20   .amyN3xYjaldFXYpeUry. 3;&.K.. ..  .0... !.pepe_j
00008E50   ust_created.tar.bz2.JlyD0W .. .@....... .NGb0URO
00008E80   C0Bh9cHwp-hBh.6wMS .. .p  . ... ..registro.0DPzS
00008EB0   G  .. . ....... .8n-.w$.9. .. .   .8... +.suse_u
00008EE0   pgrade_to_102_pkglist-bis.txt.tcFUKq. . .......
00008F10   #B-XqcrWP4cqsw77yv8UsYbcCa-D76q..(#.. ..  .8...
00008F40   '.suse_upgrade_to_102_pkglist.txt.0KTuDa  7.. .8

I just had a quick look with 'mc', the dump is to large too inspect it 
all.

Question.

As this always happens on recovery from hibernation, and seeing the message
"Corruption of in-memory data detected", could it be that thawing does a bad
memory recovery from the swap?  I thought that the procedure includes some
checksum, but I don't know for sure.

Not sure, though if so I would think that might be a more common source
of problems.

And it only affects my /home partition - although it may be the busiest 
one.

To me, there are two problems:

 1) The corruption itself.
 2) That xfs_repair fails to repair the filesystem. In fact, I believe
    it does not detect it!

To me, #2 is the worst, and it is what makes me do the backup, format,
restore cycle for recovery. An occassional kernel crash is somewhat
acceptable :-}

Well it could be that the "corruption" is gone at the point of a
remount. E.g., something becomes inconsistent in memory, the fs detects
it and shuts down before going any further. That's actually a positive.
;)

That also means it's probably not be necessary to do a full backup,
reformat and restore sequence as part of your routine here. xfs_repair
should scour through all of the allocation metadata and yell if it finds
something like free blocks allocated to a file.

No, if I don't backup-format-restore it happens again within a day. There 
is something lingering. Unless that was just chance... :-?

It is true that during that day I hibernated several times more than 
needed to see if it happened again - and it did.

I'm curious if something like an 'rm -rf *' on the metadump
would catch any other corruptions or if this is indeed limited to
something associated with recent (pre)allocations.

Sorry, run 'rm -rf *' where???

On the metadump... mainly just to see whether freeing all of the used
blocks in the fs triggered any other errors (i.e., a brute force way to
check for further corruptions).

Sorry, but I fail to see how to do it. I maybe thick, or I lack the context.

If I run:

Telcontar:/data/storage_d/old_backup # ls -lh
total 604G
drwxr-xr-x 22 root root  4.0K Mar  8 20:30 home
drwxr-xr-x  3 root root    16 Sep 25  2010 home1
drwxr-xr-x  2 root root     6 Jul  3 02:36 mount
- -rw-r--r--  1 root root    45 Jul  3 04:25 procedure
- -rw-r--r--  1 root root  388M Jul  3 02:42 tgtfile
- -rw-r--r--  1 root root   11M Jul  3 02:50 tgtfile2.xz
- -rw-r--r--  1 root users 489G Mar 16 05:42 xfs_copy_home
- -rw-r--r--  1 root root  489G Jul  3 04:40 xfs_copy_home_workonit
- -rw-r--r--  1 root users  39G Mar 16 05:49 xfsdump__home
- -rw-r--r--  1 root users  39G Mar 16 05:57 xfsdump__home1
Telcontar:/data/storage_d/old_backup # rm -rf *

that would destroy my entire backup!

If you mean:

 rm -rf tgtfile

I fail to see what that would accomplish, except to remove a file that is actually on a different partition, not home.

However, I can do:

Telcontar:/data/storage_d/old_backup # mount -v xfs_copy_home_workonit mount/
mount: /dev/loop0 mounted on /data/storage_d/old_backup/mount.
Telcontar:/data/storage_d/old_backup # cd mount
Telcontar:/data/storage_d/old_backup/mount # time rm -r /data/storage_d/old_backup/mount/*
Telcontar:/data/storage_d/old_backup/mount # time rm -r /data/storage_d/old_backup/mount/*

real    2m45.380s
user    0m0.265s
sys     0m6.878s
Telcontar:/data/storage_d/old_backup/mount #
Telcontar:/data/storage_d/old_backup/mount # ls -la
total 4
drwxr-xr-x 2 root root    6 Jul  4 01:56 .
drwxr-xr-x 5 root root 4096 Jul  3 04:25 ..
Telcontar:/data/storage_d/old_backup/mount #
Telcontar:/data/storage_d/old_backup/mount # df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      489G   33M  489G   1% /data/storage_d/old_backup/mount
Telcontar:/data/storage_d/old_backup/mount #

And I do not see anything on the log, only that it mounted cleanly.

Meanwhile, I have done a xfs_metadump of the image, and compressed it with
xz. It has 10834536 bytes. What do I do with it? I'm not sure I can email
that, and even less to a mail list.

Do you still have a bugzilla system where I can upload it? I had an account
at <http://oss.sgi.com/bugzilla/>, made on 2010. I don't know if it still
runs :-?

I have an active bugzilla account at <http://oss.sgi.com/bugzilla/>, I'm 
logged in there now. I haven't checked if I can create a bug, not been 
sure what parameters to use (product, component, whom to assign to). I 
think that would be the most appropriate place.

Meanwhile, I have uploaded the file to my google drive account, so I can 
share it with anybody on request - ie, it is not public, I need to add a 
gmail address to the list of people that can read the file.

Alternatively, I could just email the file to people asking for it, 
offlist, but not in a single email, in chunks limited to 1.5 MB per 
email.

I think http://bugzilla.redhat.com should allow you to file a bug and
attach the file.

Sorry, I don't have an account there...

I do have one at openSUSE, though, and it does allow me to attach files, up 
to a limit. If the file is to big, it can be fragmented in pieces. But I 
will not use it unless you people say that you have an account there.

For using a bugzilla, the most appropriate one would be at SGI, IMHO, if 
they are still supporting this project.

- -- 
Cheers,
       Carlos E. R.
       (from 13.1 x86_64 "Bottle" at Telcontar)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlO3HXUACgkQtTMYHG2NR9VndgCgillZYmQCvUynytO/7YALlUyv
c9gAnj8GmFfnMHGd+P9GaWm9ScVVTH81
=GEXl
-----END PGP SIGNATURE-----

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs