NILFS2 and data integrity

Dmitry Smirnov <onlyjob@xxxxxxxxxxxxxx> · Fri, 5 Apr 2013 15:45:16 +1100

Dear NILFS team,

Let me thank you sincerely for fantastic and very special file system.

Until now I've been using it successfully for years without any issues
except for minor inconvenience from slow `nilfs_cleanerd`.

I'd like to share the details of the incident when recently I experienced
data corruption on NILFS2 partition followed by unfortunate adding of
unreliable "SAMSUNG HD204UI" HDD to underlying "mdadm" array.

The notorious HDD [1][2] occasionally corrupts data on write
so later read returns wrong data. There is no way to avoid such corruption
in first place. Detection is also difficult because as you may know in Linux
there is no block-level integrity checking yet.
However NILFS2 suffers the most from that particular type of corruption because
`nilfs_cleanerd` moves unmodified data around and therefore amplifies the
damage.

First I noticed corruption on some archives that were OK some weeks ago and
didn't change since (according to last modification date). As time passed
more damage was found in files that didn't suppose to change.
Finally the root cause of corruption was identified and bad HDD was promptly
removed from array. That's when I thought that the issue was resolved but
few days later NILFS2 re-mounted itself as read-only and logged the following
to "/var/log/kern.log":

	Mar 24 11:38:14 deblabr kernel: [191771.927806] NILFS: bad btree node (blocknr=1919583732): level = 193, flags = 0x90, nchildren = 35672
	Mar 24 11:38:14 deblabr kernel: [191771.927812] NILFS error (device dm-0): nilfs_bmap_lookup_contig: broken bmap (inode number=444589)
	Mar 24 11:38:14 deblabr kernel: [191771.927812] 
	Mar 24 11:38:14 deblabr kernel: [191772.126584] Remounting filesystem read-only
	Mar 24 11:38:15 deblabr kernel: [191772.174965] NILFS: bad btree node (blocknr=1919583732): level = 193, flags = 0x90, nchildren = 35672
	Mar 24 11:38:15 deblabr kernel: [191772.174972] NILFS error (device dm-0): nilfs_bmap_lookup_contig: broken bmap (inode number=444589)
	Mar 24 11:38:15 deblabr kernel: [191772.174972] 
	Mar 24 11:38:15 deblabr kernel: [191772.175255] NILFS: bad btree node (blocknr=1919583732): level = 193, flags = 0x90, nchildren = 35672
	Mar 24 11:38:15 deblabr kernel: [191772.175258] NILFS error (device dm-0): nilfs_bmap_lookup_contig: broken bmap (inode number=444589)

As far as I understand the issue, corruption in data is not detected until
one or more "btree" nodes got corrupted as well. I reproduced the problem
on isolated "bad" HDD.
In this case I first copied some data to NILFS2 partition and verified
its integrity. As I was adding more data `nilfs_cleanerd` activated
and as expected corrupted some of the data. Eventually it failed to continue:

	Mar 31 01:17:30 deblabr kernel: [759042.984783] NILFS: bad btree node (blocknr=938583): level = 192, flags = 0x73, nchildren = 49956
	Mar 31 01:17:30 deblabr kernel: [759042.984850] NILFS: GC failed during preparation: cannot read source blocks: err=-5

Also file system was re-mounted read-only:

	Mar 30 19:56:59 deblabr kernel: [739821.894963] NILFS: bad btree node (blocknr=1086570306): level = 239, flags = 0xe2, nchildren = 10392
	Mar 30 19:56:59 deblabr kernel: [739821.894969] NILFS error (device dm-0): nilfs_bmap_last_key: broken bmap (inode number=1225452)
	Mar 30 19:56:59 deblabr kernel: [739821.894969] 
	Mar 30 19:56:59 deblabr kernel: [739821.894971] Remounting filesystem read-only
	Mar 30 19:56:59 deblabr kernel: [739821.894973] NILFS warning (device dm-0): nilfs_truncate_bmap: failed to truncate bmap (ino=1225452, err=-5)

(please ignore time stamp as those logs were taken from two different
 attempts to reproduce).

With read-only NILFS2 and some corrupted btree nodes I know no other way
to recover than to restore all the data to freshly formatted partition
as the lack of `fsck` tool do not allow to repair damaged file system.

I think there are some lessons we can learn from this:

 * Data integrity is very important.

 * On unreliable media `nilfs_cleanerd` can amplify the damage from corruption
   similar to what may happen on other file systems during defragmentation.

 * To avoid the unnecessary damage it would be nice if `nilfs_cleanerd` could
   check data integrity on read and stop with corresponding message logged in
   case of corruption.

 * `fsck` could be helpful to repair corrupted btree nodes.

 * Btrfs have a strategic advantage over NILFS2 in regards to data integrity
   checking.

Having said that I'd like to note that in my experience NILFS2 *perfectly*
recovers from unclean shut down or unexpected reset. This problem happened
only because NILFS2 put too much trust to underlying media.

Thank you.

All the best,
 Dmitry

[1]: http://rctnotes.blogspot.com.au/2011/02/samsung-2-tb-hd204ui-firmware-bug.html
[2]: http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks

---

If any remedy is tested under controlled scientific conditions and
proved to be effective, it will cease to be alternative and will simply
become medicine. So-called alternative medicine either hasn't been
tested or it has failed its tests.
        -- Richard Dawkins, 2007
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html