On 12/10/12 3:12 AM, Matthias Schniedermeyer wrote: > On 10.12.2012 11:58, Dave Chinner wrote: >> On Sat, Dec 08, 2012 at 08:29:27PM +0100, Matthias Schniedermeyer wrote: >>> On 06.12.2012 09:51, Lin Li wrote: >>>> Hi, Guys. I recently suffered a huge data loss on power cut on an XFS >>>> partition. The problem was that I copied a lot of files (roughly 20Gb) to >>>> an XFS partition, then 10 hours later, I got an unexpected power cut. As a >>>> result, all these newly copied files disappeared as if they had never been >>>> copied. I tried to check and repair the partition, but xfs_check reports no >>>> error at all. So I guess the problem is that the meta data for these files >>>> were all kept in the cache (64Mb) and were never committed to the hard >>>> disk. >>>> >>>> What is the cache flush policy for XFS? Does it always reserve some fixed >>>> space in cache for metadata? I asked because I thought since I copied such >>>> a huge amount of data, at least some of these files must be fully committed >>>> to the hard disk, then cache is only 64Mb anyway. But the reality is all of >>>> them were lost. the only possibility I can think is some part of the cache >>>> was reserved for meta data, so even the cache is fully filled, this part >>>> will not be written to the disk. Am I right? >>> >>> I have the same problem, several times. >>> >>> The latest just an hour ago. >>> I'm copying a HDD onto another. Plain rsync -a /src/ /tgt/ Both HDDs are >>> 3TB SATA-drives in a USB3-enclosure with a dm-crypt layer in between. >>> About 45 minutes into copying the target HDD disconnects for a moment. >>> 45minutes means someting over 200GB were copied, each file is about >>> 900MB. >>> After remounting the filesystems there were exactly 0 files. >> >> This sounds like an entirely different problem to what the OP >> reported. > > For me it sounds only like different timing. > Otherwise i don't see much difference in files vanished after a few > hours(of inactiviry) and a few minutes (while still beeing active). > >> Did the filesystem have an error returned? > > No. > >> i.e. did it shut down (what's in dmesg)? > > There's not much XFS could have done after the block-device vanished. except to shut down... > A dis-/r-eappierung block-device gets a new name because the old name is > still "in use", the block-devic gets cleaned up after 'umount'ing and > closing the dm-crypt device. > > When the USB3-HDD disconnected it reappered a moment later under a new > name, it bounced between sdc <-> sdf. > > In syslog it's a plain "USB disconnect, device number XX" message. > Followed by a standard new device found message-bombardment. In between > there are some error-messages, but as it's pratically a yanked out and > replugged cable, a little complaing by the kernel is to be expected. Sure, but Dave asked if the filesystem shut down. XFS messages would tell you that; *were* there messages from XFS in the log from the event? Sometimes "a little complaining" can be quite informative. :) >> Did you run repair in between the shutdown and remount? > > No. > > XFS (dm-3): Mounting Filesystem > XFS (dm-3): Starting recovery (logdev: internal) > XFS (dm-3): Ending recovery (logdev: internal) > >> How many files in that 200GB of data? > > At 0.9GB/file at least 220. > >> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F >> >> Basically, you have an IO error situation, and you have dm-crypt >> in-between buffering an unknown about of changes. In my experience, >> data loss eventsi are rarely filesystem problems when USB drives or >> dm-crypt is involved... > > I don't know the inner workings auf dm-*, but shouldn't it behave > transparent and rely on the block-layer for buffering. I think that's partly why Dave asked you to test it, to check that theory ;) >>> After that i started a "while true; do sync ; done"-loop in the >>> background. >>> And just while i was writing this email the HDD disconnected a second >>> time. But this time the files up until the last 'sync' were retained. >> >> Exactly as I'd expect. >> >>> And something like this has happend to me at least a half dozen times in >>> the last few month. I think the first time was with kernel 3.5.X, when i >>> was actually booting into 3.6 with a plain "reboot" (filesystem might >>> not have been umounted cleanly.), after the reboot the changes of about >>> the last half hour were gone. e.g. i had renamed a directory about 15 >>> minutes before i rebooted and after the reboot the directory had it's >>> old name back. >>> >>> Kernel in all but (maybe)one case is between 3.6 and 3.6.2 (currently), >>> the first time MIGHT have been something around 3.5.8 but i'm not sure. >>> HDDs were either connected by plain SATA(AHCI) or by USB3 enclosure. All >>> affected filesystems were/are with a dm-crypt layer inbetween. >> >> Given that dm-crypt is the common factor here, I'd start by ruling >> that out. i.e. reproduce the problem without dm-crypt being used. > > That's a slight problem for me, pratically everything i have is > encrypted. But this is an external drive; you could run a similar test with unencrypted data on a different hard drive, to try to get to the bottom of this problem, right? Thanks, -Eric > Now that i think about it, maybe dm-crypt really is to blame, up until a > few month ago i was using loop-AES. After dm-crypt got the capability to > emulate it i have moved over to dm-crypt because the loop-AES support in > Debian got worse over time. I didn't have any problems until after i > moved to dm-crypt, but OTOH i'm not the only one using dm-crypt. But > OTOOH maybe not so many people use the loop-AES compatibility-mode. > > > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs