Problem: Race in fscache operation enqueuing for reading and copying multiple pages from cachefiles to netfs causes kernel oops or at worst memory scribbling. Symptoms: May 10 13:49:36 sc-sdgx-208 kernel: [510915.776755] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 May 10 13:49:36 sc-sdgx-208 kernel: [510915.776793] IP: [<ffffffff81095951>] process_one_work+0x31/0x3f0 May 10 13:49:36 sc-sdgx-208 kernel: [510915.776821] PGD 5da5094067 PUD 6caa59d067 PMD0 May 10 13:49:36 sc-sdgx-208 kernel: [510915.776843] Oops: 0000 [#1] SMP May 10 13:49:36 sc-sdgx-208 kernel: [510915.776858] Modules linked in: nfsv3 xt_recent veth xt_multiport ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc aufs ipmi_devintf cachefiles msr nvidia_drm(POE) nvidia_uvm(POE) ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 bonding ipmi_ssif ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype mxm_wmi nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp intel_rapl nf_nat x86_pkg_temp_thermal intel_powerclamp nf_conntrack_ftp coretemp nf_conntrack kvm_intel iptable_filter ip_tables kvm x_tables irqbypass crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd ast ttm input_leds drm_kms_helper sb_edac joydev drm edac_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt mei_me mei lpc_ich shpchp ipmi_si ipmi_msghandler 8250_fintek wmi acpi_power_meter nvidia_modeset(POE) mac_hid knem(OE) lp parport nfsd nvidia(POE) auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE) nls_iso8859_1 ses enclosure hid_generic mlx5_core(OE) usbhid hid ixgbe dca mlx_compat(OE) megaraid_sas vxlan ip6_udp_tunnel udp_tunnel ahci ptp libahci pps_core mdio fjes May 10 13:49:36 sc-sdgx-208 kernel: [510915.777450] CPU: 78 PID: 24681 Comm: kworker/u162:7 Tainted: P OE 4.4.0-45-generic #66~14.04.1-Ubuntu May 10 13:49:36 sc-sdgx-208 kernel: [510915.777492] Hardware name: XXXXXXXXXXXX , BIOS S2W_3A01.NVD02 11/03/2016 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777529] task: ffff88192acc0000 ti: ffff882804c44000 task.ti: ffff882804c44000 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777572] RIP: 0010:[<ffffffff81095951>] [<ffffffff81095951>] process_one_work+0x31/0x3f0 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777606] RSP: 0000:ffff882804c47e28 EFLAGS: 00010046 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777624] RAX: ffff887dff0b1e00 RBX: ffff887ec7b81a40 RCX: 0000000000000001 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777646] RDX: ffff887dff0b1e00 RSI: ffff887dff0b0240 RDI: ffff887ec7b81a40 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777668] RBP: ffff882804c47e60 R08: 0000000000000101 R09: 00000001802a001b May 10 13:49:36 sc-sdgx-208 kernel: [510915.777690] R10: 00000000ff0b1e01 R11: ffffea01f7fc2c00 R12: ffff887f657a0000 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777712] R13: 0000000000000000 R14: ffff88192acc0000 R15: ffff887dff0b0240 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777734] FS: 0000000000000000(0000) GS:ffff887f7f580000(0000) knlGS:0000000000000000 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777758] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777777] CR2: 00000000000000b8 CR3: 0000005ef67a2000 CR4: 00000000003406e0 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777799] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777820] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777842] Stack: May 10 13:49:36 sc-sdgx-208 kernel: [510915.777851] 00000000657a0018 0000000000000000 ffff887f657a0000 ffff887f657a0018 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777877] ffff887ec7b81a70 ffff88192acc0000 ffff887ec7b81a40 ffff882804c47ec8 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777902] ffffffff810961ea ffffffff817f6659 ffff887f60bf0000 ffff88192acc0000 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777928] Call Trace: May 10 13:49:36 sc-sdgx-208 kernel: [510915.777942] [<ffffffff810961ea>] worker_thread+0x11a/0x470 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777963] [<ffffffff817f6659>] ? __schedule+0x359/0x980 May 10 13:49:36 sc-sdgx-208 kernel: [510915.777982] [<ffffffff810960d0>] ? rescuer_thread+0x310/0x310 May 10 13:49:36 sc-sdgx-208 kernel: [510915.778003] [<ffffffff8109b959>] kthread+0xc9/0xe0 May 10 13:49:36 sc-sdgx-208 kernel: [510915.778020] [<ffffffff8109b890>] ? kthread_park+0x60/0x60 May 10 13:49:36 sc-sdgx-208 kernel: [510915.778039] [<ffffffff817fa88f>] ret_from_fork+0x3f/0x70 May 10 13:49:36 sc-sdgx-208 kernel: [510915.778056] [<ffffffff8109b890>] ? kthread_park+0x60/0x60 May 10 13:49:36 sc-sdgx-208 kernel: [510915.778074] Code: 48 89 e5 41 57 49 89 f7 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 10 48 8b 06 4c 8b 67 48 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41 c1 ee 05 41 83 e6 01 41 f6 May 10 13:49:36 sc-sdgx-208 kernel: [510915.779021] RIP [<ffffffff81095951>] process_one_work+0x31/0x3f0 May 10 13:49:36 sc-sdgx-208 kernel: [510915.779881] RSP <ffff882804c47e28> May 10 13:49:36 sc-sdgx-208 kernel: [510915.780787] CR2: 0000000000000008 May 10 13:49:36 sc-sdgx-208 kernel: [510915.784428] ---[ end trace 164b63cc6ad868c8 ]--- Root cause: Lets Take an example of reading two pages. Thread1: nfs_readpages nfs_readpages_from_fscache __nfs_readpages_from_fscache __fscache_read_or_alloc_pages(cookie,mapping,pages, nr_pages, end_io_func, context , gfp) op = fscache_alloc_retrieval(cookie, mapping, end_io_func, context) (op->usage =1) |cachefiles_read_or_alloc_pages |cachefiles_read_backing_file |op->op.processor = cache_files_read_copier for each netpage in list of netpages | allocate a monitor(cachefiles_one_read) one for each page ( tracker for netfs page , cachefs_page) | fscache_get_retreival(op) (op->usage ++) | find backing page. allocate if not found | read_backing_page: bmapping->a_ops->readpage(NULL,backpage) - read the page from disk | add_page_wait_queue(monitor->monitor, cachefiles_read_waiter) | fscache_put_retrieval(op) (op->usage --) So for a simple two page read , op->usage is at 2 (one for each page) Thread 2 cachefiles_read_waiter (monitor) fscache_enqueue_retrieval add the monitor to op->to_do list ##### slight DELAY for second page, so the op->to_do list has two pages, #### ##### op->usage is at 3 (2 from above + 1 for the first page) ######### fscache_enqueue_operation ## so by the time the second page is being queued, the op is freed and usage is at 0. ### fscache_get_retreival(op) (op->usage ++) queue_work(fscache_op_wq, &op->work) op->processor is set to cache_files_read_copier So for a simple two page read , op->usage is at 4(GOOD case) or 1 (BAD case) Thread 3 fscache_op_work_func(op) cachefiles_read_copier() ### loops two times for two pages as the op->to_do has two pages while (op->to_do) --> monitor { copy_highpage(monitor->netfs_page, monitor->back_page); fscache_mark_page_cached(monitor->op, monitor->netfs_page); put_page(monitor->back_page); fscache_end_io(op, monitor->netfs_page, error); --> notify page copy done to Netfs put_page(monitor->netfs_page); fscache_retrieval_complete(op, 1); fscache_put_retrieval(op); (op->usage--) kfree(monitor); } fscache_put_retrieval(op); (op->usage --) # if the first callback finishes before the delay, op->usage is at 0 and the memory for OP is freed and potentially reused # if the first callback pushes into the cache->op_gc_list , we may end up with ASSERTION FAILED (5 == 6) # ASSERTIFCMP(op->state != FSCACHE_OP_ST_INITIALISED && op->state != FSCACHE_OP_ST_COMPLETE, op->state, ==, FSCACHE_OP_ST_CANCELLED); #possible fix: #clearly adding the page to to_do list in cachefiles_read_copier prior to bumping the reference is a problem. bump up the op->usage before queuing under the object. This ensures the op is not freed. (op->usage is atleast 1) # enqueue the operation for second page. op->usage is 2 # decrement the op->usage so that the second callback frees properly. static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode, int sync, void *_key) { ... ... spin_lock(&object->work_lock); + fscache_get_retrieval(monitor->op); list_add_tail(&monitor->op_link, &monitor->op->to_do); spin_unlock(&object->work_lock); fscache_enqueue_retrieval(monitor->op); + fscache_put_retrieval(monitor->op); return 0; } Thanks Kiran -- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cachefs