Hi, We are getting a number of 'bad page state' and kernel oopses which (I suspect) are related to fscache or cachefilesd. At the time that these occur, the process that was running will go in to an uniterruptable state Sample bad page state: ----------------------------------------------------8<---------------------------------------------------- Jan 22 19:10:30 rb031 kernel: BUG: Bad page state in process nuke pfn:9c509 Jan 22 19:10:30 rb031 kernel: page:ffffea0002714240 count:0 mapcount:0 mapping: (null) index:0x6e63 Jan 22 19:10:30 rb031 kernel: page flags: 0x4000000000001000(private_2) Jan 22 19:10:30 rb031 kernel: Pid: 26646, comm: nuke Tainted: G B D 3.1.7 #2 Jan 22 19:10:30 rb031 kernel: Call Trace: Jan 22 19:10:30 rb031 kernel: [<ffffffff810a99c5>] bad_page+0xe5/0xfb Jan 22 19:10:30 rb031 kernel: [<ffffffff810aaf3c>] get_page_from_freelist+0x2ff/0x433 Jan 22 19:10:30 rb031 kernel: [<ffffffff810ab745>] __alloc_pages_nodemask+0x32a/0x6ac Jan 22 19:10:30 rb031 kernel: [<ffffffffa009caf3>] cachefiles_read_backing_file+0x12d/0x4b0 [cachefiles] Jan 22 19:10:30 rb031 kernel: [<ffffffffa009d108>] cachefiles_read_or_alloc_pages+0x292/0x2d1 [cachefiles] Jan 22 19:10:30 rb031 kernel: [<ffffffff81054da8>] ? bit_waitqueue+0x17/0x97 Jan 22 19:10:30 rb031 kernel: [<ffffffff81054e80>] ? wake_up_bit+0x25/0x2a Jan 22 19:10:30 rb031 kernel: [<ffffffffa007947e>] ? fscache_run_op+0x2e/0x48 [fscache] Jan 22 19:10:30 rb031 kernel: [<ffffffffa0079e7a>] ? fscache_submit_op+0x202/0x422 [fscache] Jan 22 19:10:30 rb031 kernel: [<ffffffffa007b450>] __fscache_read_or_alloc_pages+0x21f/0x301 [fscache] Jan 22 19:10:30 rb031 kernel: [<ffffffffa0224fc1>] __nfs_readpages_from_fscache+0x80/0x14e [nfs] Jan 22 19:10:30 rb031 kernel: [<ffffffff810ab745>] ? __alloc_pages_nodemask+0x32a/0x6ac Jan 22 19:10:30 rb031 kernel: [<ffffffffa0202593>] nfs_readpages+0xf2/0x16f [nfs] Jan 22 19:10:30 rb031 kernel: [<ffffffff810ad879>] __do_page_cache_readahead+0x142/0x1c1 Jan 22 19:10:30 rb031 kernel: [<ffffffff810ad919>] ra_submit+0x21/0x25 Jan 22 19:10:30 rb031 kernel: [<ffffffff810adc8b>] ondemand_readahead+0x18e/0x1a1 Jan 22 19:10:30 rb031 kernel: [<ffffffff81054e00>] ? bit_waitqueue+0x6f/0x97 Jan 22 19:10:30 rb031 kernel: [<ffffffff810add5d>] page_cache_sync_readahead+0x3d/0x40 Jan 22 19:10:30 rb031 kernel: [<ffffffff810a7574>] generic_file_aio_read+0x299/0x611 Jan 22 19:10:30 rb031 kernel: [<ffffffff8101b582>] ? flat_send_IPI_mask+0x11/0x13 Jan 22 19:10:30 rb031 kernel: [<ffffffffa01f835f>] nfs_file_read+0xae/0xd3 [nfs] Jan 22 19:10:30 rb031 kernel: [<ffffffff810e3bee>] do_sync_read+0xcb/0x108 Jan 22 19:10:30 rb031 kernel: [<ffffffff811b868d>] ? fsnotify_perm+0x66/0x72 Jan 22 19:10:30 rb031 kernel: [<ffffffff811b86fa>] ? security_file_permission+0x2e/0x33 Jan 22 19:10:30 rb031 kernel: [<ffffffff810e45b5>] vfs_read+0xa7/0xe3 Jan 22 19:10:30 rb031 kernel: [<ffffffff810e46b4>] sys_read+0x4a/0x71 Jan 22 19:10:30 rb031 kernel: [<ffffffff813ba7ab>] system_call_fastpath+0x16/0x1b Jan 22 19:10:30 rb031 kernel: [<ffffffff810ad7f8>] __do_page_cache_readahead+0xc1/0x1c1 Jan 22 19:10:30 rb031 kernel: [<ffffffff813b8867>] ? schedule+0x5a/0x5c Jan 22 19:10:30 rb031 kernel: [<ffffffff813b88bf>] ? io_schedule+0x56/0x6b Jan 22 19:10:30 rb031 kernel: [<ffffffff810ad919>] ra_submit+0x21/0x25 Jan 22 19:10:30 rb031 kernel: [<ffffffff810adc8b>] ondemand_readahead+0x18e/0x1a1 Jan 22 19:10:30 rb031 kernel: [<ffffffff810add17>] page_cache_async_readahead+0x79/0x82 Jan 22 19:10:30 rb031 kernel: [<ffffffff810a5c24>] ? find_get_page+0x48/0x6a Jan 22 19:10:30 rb031 kernel: [<ffffffff810a75b4>] generic_file_aio_read+0x2d9/0x611 Jan 22 19:10:30 rb031 kernel: [<ffffffffa01f835f>] nfs_file_read+0xae/0xd3 [nfs] Jan 22 19:10:30 rb031 kernel: [<ffffffff810e3bee>] do_sync_read+0xcb/0x108 Jan 22 19:10:30 rb031 kernel: [<ffffffff811b868d>] ? fsnotify_perm+0x66/0x72 Jan 22 19:10:30 rb031 kernel: [<ffffffff811b86fa>] ? security_file_permission+0x2e/0x33 Jan 22 19:10:30 rb031 kernel: [<ffffffff810e45b5>] vfs_read+0xa7/0xe3 Jan 22 19:10:30 rb031 kernel: [<ffffffff810e46b4>] sys_read+0x4a/0x71 Jan 22 19:10:30 rb031 kernel: [<ffffffff813ba7ab>] system_call_fastpath+0x16/0x1b Jan 22 19:10:52 rb031 abrt: Kerneloops: Reported 12 kernel oopses to Abrt sample Kernel oops: ----------------------------------------------------8<---------------------------------------------------- Jan 24 12:55:12 ra020 kernel: ------------[ cut here ]------------ Jan 24 12:55:12 ra020 kernel: kernel BUG at fs/nfs/fscache.c:362! Jan 24 12:55:12 ra020 kernel: invalid opcode: 0000 [#3] SMP Jan 24 12:55:12 ra020 kernel: last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map Jan 24 12:55:12 ra020 kernel: CPU 4 Jan 24 12:55:12 ra020 kernel: Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat bridge stp llc autofs4 nfs lockd nfs_acl auth_rpcgss sunrpc cachefiles fscache ipv6 kvm uinput serio_raw bnx2 iTCO_wdt iTCO_vendor_support dcdbas microcode power_meter [last unloaded: speedstep_lib] Jan 24 12:55:12 ra020 kernel: Jan 24 12:55:12 ra020 kernel: Pid: 81, comm: kswapd0 Tainted: G D 2.6.34.9-69.fc13.x86_64 #1 01V648/PowerEdge R410 Jan 24 12:55:12 ra020 kernel: RIP: 0010:[<ffffffffa01a2185>] [<ffffffffa01a2185>] nfs_fscache_release_page+0x3c/0xa6 [nfs] Jan 24 12:55:12 ra020 kernel: RSP: 0018:ffff88042607f990 EFLAGS: 00010246 Jan 24 12:55:12 ra020 kernel: RAX: ffff88041e8aea88 RBX: ffffea000df278f8 RCX: ffff88041e8ae8d0 Jan 24 12:55:12 ra020 kernel: RDX: 0040000000001009 RSI: 00000000000000d0 RDI: ffffea000df278f8 Jan 24 12:55:12 ra020 kernel: RBP: ffff88042607f9b0 R08: ffffea000df27920 R09: 00000000ffffffc8 Jan 24 12:55:12 ra020 kernel: R10: ffff88042607faf0 R11: ffffea000a23b540 R12: 0000000000000000 Jan 24 12:55:12 ra020 kernel: R13: 00000000000000d0 R14: ffff88042607fc30 R15: 0000000000000001 Jan 24 12:55:12 ra020 kernel: FS: 0000000000000000(0000) GS:ffff880237440000(0000) knlGS:0000000000000000 Jan 24 12:55:12 ra020 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Jan 24 12:55:12 ra020 kernel: CR2: 00007fe1d3b2e000 CR3: 0000000001a42000 CR4: 00000000000006e0 Jan 24 12:55:12 ra020 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 24 12:55:12 ra020 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 24 12:55:12 ra020 kernel: Process kswapd0 (pid: 81, threadinfo ffff88042607e000, task ffff880426928000) Jan 24 12:55:12 ra020 kernel: Stack: Jan 24 12:55:12 ra020 kernel: 00000000000000d0 ffffea000df278f8 00000000000000d0 ffff88041e8aeba0 Jan 24 12:55:12 ra020 kernel: <0> ffff88042607f9e0 ffffffffa017a1dc 000000000a23b540 ffffea000df27920 Jan 24 12:55:12 ra020 kernel: <0> ffffea000df278f8 ffff88042607fda0 ffff88042607f9f0 ffffffff810ca39b Jan 24 12:55:12 ra020 kernel: Call Trace: Jan 24 12:55:12 ra020 kernel: [<ffffffffa017a1dc>] nfs_release_page+0x7d/0x85 [nfs] Jan 24 12:55:12 ra020 kernel: [<ffffffff810ca39b>] try_to_release_page+0x34/0x3d Jan 24 12:55:12 ra020 kernel: [<ffffffff810d6287>] shrink_page_list+0x300/0x477 Jan 24 12:55:12 ra020 kernel: [<ffffffff81107e93>] ? mem_cgroup_del_lru+0x39/0x3b Jan 24 12:55:12 ra020 kernel: [<ffffffff810d531c>] ? isolate_pages_global+0xb7/0x1f7 Jan 24 12:55:12 ra020 kernel: [<ffffffff810d6790>] shrink_inactive_list+0x392/0x68e Jan 24 12:55:12 ra020 kernel: [<ffffffff810d6de9>] shrink_zone+0x35d/0x411 Jan 24 12:55:12 ra020 kernel: [<ffffffff810d7b6b>] balance_pgdat+0x337/0x5b6 Jan 24 12:55:12 ra020 kernel: [<ffffffff810d5265>] ? isolate_pages_global+0x0/0x1f7 Jan 24 12:55:12 ra020 kernel: [<ffffffff810d8025>] kswapd+0x23b/0x266 Jan 24 12:55:12 ra020 kernel: [<ffffffff810662ff>] ? autoremove_wake_function+0x0/0x39 Jan 24 12:55:12 ra020 kernel: [<ffffffff810d7dea>] ? kswapd+0x0/0x266 Jan 24 12:55:12 ra020 kernel: [<ffffffff81065e85>] kthread+0x7f/0x87 Jan 24 12:55:12 ra020 kernel: [<ffffffff8100aa64>] kernel_thread_helper+0x4/0x10 Jan 24 12:55:12 ra020 kernel: [<ffffffff81065e06>] ? kthread+0x0/0x87 Jan 24 12:55:12 ra020 kernel: [<ffffffff8100aa60>] ? kernel_thread_helper+0x0/0x10 Jan 24 12:55:12 ra020 kernel: Code: 00 48 8b 17 b8 01 00 00 00 48 89 fb 41 89 f5 80 e6 10 74 79 48 8b 47 18 48 8b 00 4c 8b 60 f8 48 8d 88 48 fe ff ff 4d 85 e4 75 04 <0f> 0b eb fe f6 05 3d 29 f8 ff 08 74 14 48 89 fa 4c 89 e6 48 c7 Jan 24 12:55:12 ra020 kernel: RIP [<ffffffffa01a2185>] nfs_fscache_release_page+0x3c/0xa6 [nfs] Jan 24 12:55:12 ra020 kernel: RSP <ffff88042607f990> Jan 24 12:55:12 ra020 kernel: ---[ end trace 37b1b21bd4ae7177 ]--- Jan 24 12:56:18 ra020 abrt: Kerneloops: Reported 1 kernel oopses to Abrt ----------------------------------------------------8<---------------------------------------------------- We also get unreadable PIDs for processes which appear to be related to these bad page states, i.e., trying to run a 'ps', 'w' or anything else that needs to read the PID's directory of the process that has a bad page state in /proc hangs (is this a result of page state corruption?) A bit about our environment: The NFS server is an Isilon running OneFS. We have two clients accessing the NFS server, both running FS-Cache. One set of clients submits graphics rendering jobs on an NFS filesystem that are rendered by the other clients (dedicated render boxes). All clients are running Fedora 13 x86_64 FS-cache + cachefiles (cachefilesd-0.10.1-1.fc13.x86_64) over NFS on Fedora 13 x86_64 (most run kernel 2.6.34.9-69.fc13.x86_64, although quite a few run other variants of 2.6.34 such as 2.6.34.7-56.fc13.x86_64, we have also tried a backported 2.6.35 fedora 14 kernel, and 2.6.39 kernel, as well as a vanilla 3.1.7 kernel, with no decrease in errors). We have also tried switching off cachefilesd user space process, but that does not stop the bad page states or oopses. After a while, maybe due to page state problems, the load begins to grow due to uninterruptible processes and unreadable PIDs and the boxes need to be rebooted. Any idea what is going wrong here? Any help greatly appreciated! Kind regards, Campbell --------------------------------------------------------------------- PS: probably unrelated, but we also get a lot of these errors (sometimes these appear, sometimes not): Jan 22 03:51:41 ra042 kernel: CacheFiles: Error: Overlong wait for old active object to go away Jan 22 03:51:41 ra042 kernel: object: OBJ640b Jan 22 03:51:41 ra042 kernel: objstate=OBJECT_LOOKING_UP fl=0 swfl=a ev=0[7b] Jan 22 03:51:41 ra042 kernel: ops=0 inp=0 exc=0 Jan 22 03:51:41 ra042 kernel: parent=ffff880420814900 Jan 22 03:51:41 ra042 kernel: cookie=ffff8802252110a0 [pr=ffff88042508b230 nd=ffff88014d7e2850 fl=7] Jan 22 03:51:41 ra042 kernel: key=[32] '60a6fe830c0000001862a15901000000ffffffff000000000c00420001000000' Jan 22 03:51:41 ra042 kernel: xobject: OBJ640a Jan 22 03:51:41 ra042 kernel: xobjstate=OBJECT_DYING fl=0 swfl=8 ev=20[5] Jan 22 03:51:41 ra042 kernel: xops=1 inp=1 exc=0 Jan 22 03:51:41 ra042 kernel: xparent=ffff880420814900 Jan 22 03:51:41 ra042 kernel: xcookie=NULL Jan 22 03:52:41 ra042 kernel: -- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cachefs