On Sat, Jul 12, 2014 at 02:30:45AM +0200, Carlos E. R. wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > > > On Saturday, 2014-07-05 at 08:28 -0400, Brian Foster wrote: > >On Fri, Jul 04, 2014 at 11:32:26PM +0200, Carlos E. R. wrote: > > > >>If I don't do that backup-format-restore, I get issues soon, and it crashes > >>within a day - I got after booting (the first event): > >> > > > >I echo Dave's previous question... within a day of doing what? Just > >using the system or doing more hibernation cycles? > > It is in the long post with the logs I posted. > > The first time it crashed, I rebooted, got some errors I probably did not > see, managed to mount the device, and I used the machine normally, doing > several hibernation cycles. On one of these, it crashed, within the day. > That still suggests something could be going on at runtime during the hibernation or wakeup cycle. Identifying some kind of runtime error or metadata inconsistency without involving hibernation would be a smoking gun for a general corruption. So far we have no evidence of reproduction without hibernation and no evidence of a persistent corruption. That doesn't rule out something going on on-disk, but it certainly suggests a runtime corruption during hibernation/wake is more likely. > > As explained in this part of the previous post: > > >>0.1> 2014-03-15 03:53:47 Telcontar kernel - - - [ 301.857523] XFS: Internal error XFS_WANT_CORRUPTED_RETURN at line 350 of file /home/abuild/rpmbuild/BUILD/kernel-desktop-3.11.10/linux-3.11/fs/xfs/xfs_all > >> > >>And some hours later: > >> > >><0.1> 2014-03-15 22:20:34 Telcontar kernel - - - [20151.298345] XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 1602 of file /home/abuild/rpmbuild/BUILD/kernel-desktop-3.11.10/linux-3.11/fs/xfs/xfs_allo > >> > >> > >>It was here that I decided to backup-format-restore instead. > > > > > > >>>That also means it's probably not be necessary to do a full backup, > >>>reformat and restore sequence as part of your routine here. xfs_repair > >>>should scour through all of the allocation metadata and yell if it finds > >>>something like free blocks allocated to a file. > >> > >>No, if I don't backup-format-restore it happens again within a day. There is > >>something lingering. Unless that was just chance... :-? > >> > >>It is true that during that day I hibernated several times more than needed > >>to see if it happened again - and it did. > >> > > > >This depends on what causes this to happen, not how frequent it happens. > >Does it continue to happen along with hibernation, or do you start > >seeing these kind of errors during normal use? > > > Except the first time that this happened, the sequence is this: > > I use the machine for weeks, without event, booting once, then hibernating > at least once per day. I finally reboot when I have to apply some system > update, or something special. > > Till one day, this "thing" happens. It happens inmediately after coming out > from hibernation, and puts the affected partition, always /home, in read > only mode. When it happens, I reboot, repair partition manually if needed, > then I back up the files, format it, and replace all the files from the > backup just made, with xfsdump. Well, this last time, I used rsync instead. > > > It has happened "only" four times: > > 2014-03-15 03:35:17 > 2014-03-15 22:20:34 > 2014-04-17 22:47:08 > 2014-06-29 12:32:18 > > > >If the latter, that could suggest something broken on disk. > > That was my first thought, because it started hapening after replacing the > hard disk, but also after a kernel update. But I have tested that disk > several times, with smartctl and with the manufacturer test tool, and > nothing came out. > I was referring to a potential on-disk corruption, but that's good to know as well. > > >If the > >former, that could simply suggest the fs (perhaps on-disk) has made it > >into some kind of state that makes this easier to reproduce, for > >whatever reason. It could be timing, location of metadata, > >fragmentation, or anything really for that matter, but it doesn't > >necessarily mean corruption (even though it doesn't rule it out). > >Perhaps the clean regeneration of everything by a from-scratch recovery > >simply makes this more difficult to reproduce until the fs naturally > >becomes more aged/fragmented, for example. > > > >This probably makes a pristine, pre-repair metadump of the reproducing > >fs more interesting. I could try some of my previous tests against a > >restore of that metadump. > > > Well, I suggest that, unless you can find something on the metadata (I just > sent you the link via email from google), we wait till the next event. I > will at that time take an intact metadata photo. But this can take a month > or two to happen again, if the pattern keeps. > That would be a good idea. I'll take a look at the metadump when I have a chance. If there is nothing out of the ordinary, the next best option is to metadump the fs that reproduces the behavior. I could retry some of my previous vm hibernation tests against that. As mentioned previously, once you have a more reliably reproducing state, that's also a good opportunity to see if you can narrow down which of the things you have running against the fs appear to trigger this. > > > > >I was somewhat thinking out loud originally discussing this topic. I was > >suggesting to run this against a restored metadump, not the primary > >dataset or a backup. > > > >The metadump creates an image of the metadata of the source fs in a file > >(no data is copied). This metadump image can be restored at will via > >'xfs_mdrestore.' This allows restoring to a file, mounting the file > >loopback, and performing experiments or investigation on the fs > >generally as it existed when the shutdown was reproducible. > > Ah... I see. > > > >So basically: > > > >- xfs_mdrestore <mdimgfile> <tmpfileimg> > >- mount <tmpfileimg> /mnt > >- rm -rf /mnt/* > > > >... was what I was suggesting. <tmpfileimg> can be recreated from the > >metadump image afterwards to get back to square one. > > I see. > > Well, I tried this on a copy of the 'dd' image days ago, and nothing > hapened. I guess the procedure above would be the same. > A dd of the raw block device will preserve the metadata, so yeah that's effectively the same test. If there were an obvious free space corruption, the fs probably would have shutdown. I can retry the same test via the metadump on a debug kernel as well. Brian > > > > > >>I have an active bugzilla account at <http://oss.sgi.com/bugzilla/>, I'm > >>logged in there now. I haven't checked if I can create a bug, not been sure > >>what parameters to use (product, component, whom to assign to). I think that > >>would be the most appropriate place. > >> > >>Meanwhile, I have uploaded the file to my google drive account, so I can > >>share it with anybody on request - ie, it is not public, I need to add a > >>gmail address to the list of people that can read the file. > >> > >>Alternatively, I could just email the file to people asking for it, offlist, > >>but not in a single email, in chunks limited to 1.5 MB per email. > >> > > > >Either of the bugzilla or google drive options works Ok for me. > > It's here: > > <https://drive.google.com/file/d/0Bx2OgfTa-XC9UDBnQzZIMTVyN0k/edit?usp=sharing> > > Whoever wants to read it, has to tell me the address to add to it, access is > not public. > > > - -- Cheers, > Carlos E. R. > (from 13.1 x86_64 "Bottle" at Telcontar) > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.22 (GNU/Linux) > > iEYEARECAAYFAlPAgb0ACgkQtTMYHG2NR9U/FQCgjtwuDC0HTSG3i7DrEV8+qZeT > 6mUAn0FGf42SsU1WeRx/AAk4X2oqV4Bc > =pASJ > -----END PGP SIGNATURE----- > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs