> -----Original Message----- > From: Greg Freemyer [mailto:greg.freemyer@xxxxxxxxx] > Sent: Tuesday, March 29, 2011 10:34 AM > To: Rogier Wolff > Cc: Eric Sandeen; Daniel Taylor; linux-ext4@xxxxxxxxxxxxxxx > Subject: Re: breaking ext4 to test recovery > > On Tue, Mar 29, 2011 at 10:33 AM, Rogier Wolff > <R.E.Wolff@xxxxxxxxxxxx> wrote: > > On Tue, Mar 29, 2011 at 08:50:18AM -0500, Eric Sandeen wrote: > >> Another tool which can be useful for this sort of thing is > >> fsfuzzer. It writes garbage; using dd to write zeros actually > >> might be "nice" corruption. > > > > Besides writing blocks of "random data", you could write > blocks with a > > small percentage of bits (byte) set to non-zero, or just toggle a > > configurable number of bits (bytes). This is slightly more > devious than just > > "random data". > > I don't know what exactly is being tested, but "hdparm > --make-bad-sector" can be used to create a media error on a specific > sector. > > Thus allowing you to simulate a sector failing in the middle > of the journal. > > I assume that is a relevant test. > > fyi: --repair-sector undoes the damage. You may need to follow that > with a normal write to put legit data there. > > If you try a normal data write without first repairing, the drive > should mark the sector permanently bad and remap that sector to a > spare sector. > > I have only used these tools with raw drives, no partitions, etc. So > I've never had to worry about data loss, etc. > > Greg > Thanks for the suggestions. Tao Ma's got me started, but doing some of the more "devious" tests is on my list, too. The original issue was that during component stress testing, we were seeing instances of the ext4 file system becoming "read-only" (showing in /proc/mounts, but not "mount"). Looking back through the logs, we saw that at mount time, there was a complaint about a corrupted journal. Some writing had occurred before the change to read-only, however. The original mount script didn't check for any "mount" return value, so we theorized that ext4 just got to a point where it couldn't sensibly handle any more changes. It seemed that the right answer was to check the return value from mount and, if non-0, umount the file system, fix it, and try again. To test the return value from mount, I need to be able to corrupt, but not destroy the journal, since the component tests were taking days to show the failure. Running an "fsck -f" every time on a 3TB file system with an embedded PPC was just taking too much time to impose on a consumer-level customer. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html