On Tue, Aug 30, 2011 at 09:08:59AM +0800, 黃一軒 (I-Hsuan Huang) wrote: > Dear Darrick, > > Thanks for your correction. > Would you please kindly give me some idea about how to prevent extent > attribute block error? > I would like to try journal_checksum. First you'd have to figure out how the extent attribute got corrupted (bad disk, bad cable, bad memory, bad software....) and then we could put together a plan to fix it. The ext4 metadata checksumming patches (which are nearly ready to be sent out) would at least help with silent corruption. --D > > Thanks, > IH > > 2011/8/30 Darrick J. Wong <djwong@xxxxxxxxxx> > > > On Mon, Aug 29, 2011 at 01:40:11PM +0800, 黃一軒 (I-Hsuan Huang) wrote: > > > Dear Darrick, > > > > > > I still met the following failure symptom with your patch. > > > > > > <2>[ 3.311674] EXT4-fs error (device mmcblk0p23): ext4_iget: inode > > > #22944: (comm init) bad extended attribute block 173879342 > > > > > > The validation code looks like not work. > > > Is changing the flag enough to force validating extent header? > > > > Extended attribute blocks are not extent tree blocks. > > > > The patch in question only made ext4 complain consistently if the extent > > tree > > got corrupt. > > > > --D > > > > > > Thanks, > > > IH Huang > > > > > > 2011/8/25 Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > On Wed, Aug 24, 2011 at 05:54:15PM +0800, 黃一軒 (I-Hsuan Huang) wrote: > > > > > Dear Darrick, > > > > > > > > > > I have read your ext4 patch on Patchwork and have a question about > > your > > > > ext4 > > > > > patch to validate the extent header. > > > > > Is you patch related to the Bug 20992: > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=20992 > > > > > > > > Yes, I think it would fix that bug. That said, I'm pretty close to > > posting > > > > a > > > > patch set to implement full metadata checksuming, which includes a > > better > > > > solution than what's in patchwork. Basically it implements a new > > > > bufferhead > > > > flag that tracks whether or not a metadata buffer has been validated, > > and > > > > skips > > > > the checks if it has. > > > > > > > > Obviously that won't help you if you overwrite the filesystem while > > it's > > > > mounted... but oh well, at least it'll fix your case. It was meant to > > fix > > > > the > > > > case where the first time a bad extent block gets read it fails the > > check > > > > and > > > > complains, but subsequent accesses of the same block don't even check. > > > > > > > > --D > > > > > > > > > > Ted mentioned that > > > > > > > > > > *Yep, looks like a bug alright. > > > > > > > > > > From what I can tell, you were in the middle of async I/O, at the > > time > > > > when > > > > > the > > > > > disk was corrupted. The problem seemed to come after the I/O was > > > > completed, > > > > > and ext4_convert_unwritten_extents() was trying to set the > > initialized > > > > bit > > > > > on > > > > > the extent tree. At that point the extent tree must have gotten > > > > corrupted > > > > > on > > > > > disk, and this seriously confused the extent conversion code, which > > ended > > > > up > > > > > passing 0 to ext4_ext_put_in_cache() as the length of the extent, and > > > > that > > > > > tripped the BUG_ON in ext4_ext_put_in_cache(). > > > > > > > > > > How did you corrupt the file system while it was mounted? Was it > > via > > > > some > > > > > dd > > > > > to the disk device directly? > > > > > > > > > > We do have code that checks to make sure the extent tree is sane, but > > we > > > > > skip > > > > > it if the data was already in the buffer cache, to save CPU costs. > > But > > > > if > > > > > you > > > > > wrote to the disk device directly, it would have gone through the > > buffer > > > > > cache, > > > > > since the extent tree was already cached, we would have skipped the > > > > > validation > > > > > step, and that could be the explanation for how the bug got > > triggered. > > > > > > > > > > If so, I'm loathe to turn on the validation check unconditionally, > > since > > > > > that > > > > > would kill performance. I can probably change the BUG_ON in > > > > > ext4_put_in_cache() to rather set the cache state to "invalid", which > > > > would > > > > > at > > > > > least prevent the BUG_ON. The filesystem was probably well and truly > > > > > trashed, > > > > > though, so sooner or later the ext4 fs code would have hit something > > to > > > > > cause > > > > > it to be very unhappy. Hoepfully it would be an ext4_error() call to > > > > mark > > > > > the > > > > > file system as corrupted, as opposed to another BUG_ON. > > > > > * > > > > > Have a nice day. > > > > > > > > > > Thanks, > > > > > Randall > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html