RE: Re: Kernel Bug on Linux 4.1.31, possibly nilfs, not sure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Michael,

As far as I can see, the nilfs thread is simply sleeping:

[5187529.914580]  [<ffffffff80d6ab5c>] ? __schedule+0x4a2/0x6ba
[5187529.914580]  [<ffffffff8054415d>] ? radix_tree_lookup_slot+0x10/0x24
[5187529.914580]  [<ffffffff802b9a59>] ? find_get_entry+0x15/0x63
[5187529.914580]  [<ffffffff802b9b6d>] ? pagecache_get_page+0x74/0x15c
[5187529.914580]  [<ffffffff80458a81>] ? nilfs_grab_buffer+0xa2/0xd9

The most probable reason was the crash of another kernel thread. But it's hard to say
if the shared bug is the direct reason of such issue with nilfs's thread. 
Usually, radix tree has some memory reservation. But if the memory
subsystem was severely corrupted then it could affect and radix tree too. It is not enough
information for making any conclusions. But I suspect that the initial reason of the issue
lives in the memory subsystem code.

Thanks,
Viacheslav Dubeyko.

---- Original Message ----
Subject: Re: Kernel Bug on Linux 4.1.31, possibly nilfs, not sure
Sent: Nov 28, 2016 10:02 AM
From: Michael Conrad <mconrad@xxxxxxxxxxxxxxx>
To: linux-nilfs@xxxxxxxxxxxxxxx
Cc: 

Hello, I found a bug that might be related.

https://patchwork.kernel.org/patch/9339843/

On several of my crash dumps, I saw it die in 
"memcg_drain_all_list_lrus" and found this mentioned in a bug where FUSE 
was calling functions to modify the radix tree directly in a way that 
other kernel developers weren't expecting.  (I don't fully understand 
the details)

Our systems that are crashing are not using the FUSE driver for 
anything, but the crashes still coincide with nilfs2 activity.  Is there 
a chance nilfs is affected by this same situation?

I've attached another example crash log.  We're still on 4.1.31, but 
will move to 4.1.35 soon.

-Mike

On 9/28/2016 4:59 PM, Michael Conrad wrote:
> Hi, I've started getting "kernel bug" messages on a few systems. At 
> first I wasn't sure if it was related to faulty hardware or not.  The 
> kernel stack traces do not always mention nilfs, but they roughly 
> relate to the times when we make nilfs snapshots of our system (rsync 
> from ext4 into nilfs).
>
> We use the same kernel image on several dozen systems, and aside from 
> one server, there have only been two crashes like this in the past two 
> months.  (that "one server" though has crashed about four or five 
> times, which is why I suspected hardware at first)
>
> However I just had one happen on a Linode, which is pretty reliable 
> hardware, so I figured I'd post and see what people think.  The end of 
> the kernel log is attached.
>
> I'm worried that something is corrupting kernel memory, and then 
> causing crashes in un-related parts of the kernel.  I'm really not 
> sure how to narrow it down other than turning off the nilfs snapshots 
> and see if it continues to happen, though then i have to come up with 
> another backup solution in the mean time.
>
> It's worth noting that the server crashing frequently also has the 
> largest tree of files.  Also, the nilfs filesystems on these systems 
> date back to various versions of nilfs from the 3.* kernel line. It's 
> possible that an old bug is lurking on the filesystem structure, but I 
> don't believe nilfs has a check tool yet, correct?
>
> Thanks,
> Michael Conrad
>
��.n��������+%������w��{.n�����{��x�~���n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�

[Index of Archives]     [Linux Filesystem Development]     [Linux BTRFS]     [Linux CIFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux