Re: A lot of NILFS: bad btree node messages (readonly fs)

Vyacheslav Dubeyko <slava@xxxxxxxxxxx> · Tue, 8 Jan 2013 15:52:30 +0300

Hi guys,

I am trying to reproduce the issue last three days but without success. I tried different workloads and different environments. As I know all of you have the issue in reproduced state. So I have additional questions.

1. All of you have such messages:

Jan 03 22:36:38 [kernel] [  953.289973] NILFS: bad btree node (blocknr=26229286): level = 67, flags = 0xee, nchildren = 40
Jan 03 22:36:38 [kernel] [  953.289976] NILFS error (device sda2): nilfs_bmap_lookup_contig: broken bmap (inode number=102230)

As I understand, you still have message for concrete block number (for example, blocknr=26229286) during remount. But you haven't the message for this block number (for example, blocknr=26229286) after umount and mount again. But you can get error messages for another block number after it. Am I correct?

2. As I understand, you have corrupted file on your volume after such error message (for example, for inode number=102230).

在 2013-1-6，12:46，Elmer Zhang <freeboy6716@xxxxxxxxx> 写道：

> I have found the corrupted file using inode number:
> [root@yf237 data0]# cat mysql6003/app_wyxgrab/weibo_rank.MYI > /dev/null 
> cat: mysql6003/app_wyxgrab/weibo_rank.MYI: Input/output error

Could you share strace output for "cat" command for such corrupted file? Maybe syslog can contain some interesting details during execution of "cat" command. Could you check syslog for interesting error messages during such try?

3. Could you share configuration file of your kernel (.config)? I suspect that you can have some special configuration of your environment that I haven't.

4. Could you share content of nilfs_cleanerd.conf file for NILFS2 partition that has such issue? Sorry, if I ask about it again.

5. Did you have any sudden power-off before you encounter the issue firstly?

6. I understand that it can be not so easy. But, anyway, could you share details of your system log for the case of first case of the issue occurrence? I need only details about how live system before the issue.

7. I analyzed the raw dump of segment that I received from Elmer Zhang. Currently, I have such feeling that it takes place situation when driver tries to take block that was filled by GC yet. But it needs to investigate the issue more deeply. And, currently, I don't understand how the issue can be achieved. Successful reproducing of the issue is a half of the success.

Thanks,
Vyacheslav Dubeyko.

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html