Hi, I have put a patch for this issue in our local environment and seem to be working okay. Can someone review the patch and let us know if I am in the right direction. Thanks Kiran On Wed, May 17, 2017 at 1:12 PM, KiranKumar Modukuri < kiran.modukuri@xxxxxxxxx> wrote: > Problem: Race in fscache operation enqueuing for reading and copying > multiple pages from cachefiles to netfs causes kernel oops or at worst > memory scribbling. > > > > Symptoms: > > > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.776755] BUG: unable to handle > kernel NULL pointer dereference at 0000000000000008 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.776793] IP: > [<ffffffff81095951>] process_one_work+0x31/0x3f0 > > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.776821] PGD 5da5094067 PUD > 6caa59d067 PMD0 > > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.776843] Oops: 0000 [#1] SMP > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.776858] Modules linked in: > nfsv3 xt_recent veth xt_multiport ipt_MASQUERADE nf_nat_masquerade_ipv4 > xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc > aufs ipmi_devintf cachefiles msr nvidia_drm(POE) nvidia_uvm(POE) > ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6 > nf_defrag_ipv6 bonding ipmi_ssif ipt_REJECT nf_reject_ipv4 nf_log_ipv4 > nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype mxm_wmi > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables > nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp intel_rapl nf_nat > x86_pkg_temp_thermal intel_powerclamp nf_conntrack_ftp coretemp > nf_conntrack kvm_intel iptable_filter ip_tables kvm x_tables irqbypass > crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul > glue_helper ablk_helper cryptd ast ttm input_leds drm_kms_helper sb_edac > joydev drm edac_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect > sysimgblt mei_me mei lpc_ich shpchp ipmi_si ipmi_msghandler 8250_fintek wmi > acpi_power_meter nvidia_modeset(POE) mac_hid knem(OE) lp parport nfsd > nvidia(POE) auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE) > ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) > ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) > ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE) > nls_iso8859_1 ses enclosure hid_generic mlx5_core(OE) usbhid hid ixgbe dca > mlx_compat(OE) megaraid_sas vxlan ip6_udp_tunnel udp_tunnel ahci ptp > libahci pps_core mdio fjes > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777450] CPU: 78 PID: 24681 > Comm: kworker/u162:7 Tainted: P OE 4.4.0-45-generic > #66~14.04.1-Ubuntu > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777492] Hardware name: > XXXXXXXXXXXX , BIOS S2W_3A01.NVD02 11/03/2016 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777529] task: ffff88192acc0000 > ti: ffff882804c44000 task.ti: ffff882804c44000 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777572] RIP: > 0010:[<ffffffff81095951>] [<ffffffff81095951>] process_one_work+0x31/0x3f0 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777606] RSP: > 0000:ffff882804c47e28 EFLAGS: 00010046 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777624] RAX: ffff887dff0b1e00 > RBX: ffff887ec7b81a40 RCX: 0000000000000001 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777646] RDX: ffff887dff0b1e00 > RSI: ffff887dff0b0240 RDI: ffff887ec7b81a40 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777668] RBP: ffff882804c47e60 > R08: 0000000000000101 R09: 00000001802a001b > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777690] R10: 00000000ff0b1e01 > R11: ffffea01f7fc2c00 R12: ffff887f657a0000 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777712] R13: 0000000000000000 > R14: ffff88192acc0000 R15: ffff887dff0b0240 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777734] FS: > 0000000000000000(0000) GS:ffff887f7f580000(0000) knlGS:0000000000000000 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777758] CS: 0010 DS: 0000 ES: > 0000 CR0: 0000000080050033 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777777] CR2: 00000000000000b8 > CR3: 0000005ef67a2000 CR4: 00000000003406e0 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777799] DR0: 0000000000000000 > DR1: 0000000000000000 DR2: 0000000000000000 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777820] DR3: 0000000000000000 > DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777842] Stack: > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777851] 00000000657a0018 > 0000000000000000 ffff887f657a0000 ffff887f657a0018 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777877] ffff887ec7b81a70 > ffff88192acc0000 ffff887ec7b81a40 ffff882804c47ec8 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777902] ffffffff810961ea > ffffffff817f6659 ffff887f60bf0000 ffff88192acc0000 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777928] Call Trace: > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777942] [<ffffffff810961ea>] > worker_thread+0x11a/0x470 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777963] [<ffffffff817f6659>] > ? __schedule+0x359/0x980 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.777982] [<ffffffff810960d0>] > ? rescuer_thread+0x310/0x310 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.778003] [<ffffffff8109b959>] > kthread+0xc9/0xe0 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.778020] [<ffffffff8109b890>] > ? kthread_park+0x60/0x60 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.778039] [<ffffffff817fa88f>] > ret_from_fork+0x3f/0x70 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.778056] [<ffffffff8109b890>] > ? kthread_park+0x60/0x60 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.778074] Code: 48 89 e5 41 57 > 49 89 f7 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 10 48 8b 06 4c 8b > 67 48 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 > 41 c1 ee 05 41 83 e6 01 41 f6 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.779021] RIP > [<ffffffff81095951>] process_one_work+0x31/0x3f0 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.779881] RSP <ffff882804c47e28> > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.780787] CR2: 0000000000000008 > > May 10 13:49:36 sc-sdgx-208 kernel: [510915.784428] > > ---[ end trace 164b63cc6ad868c8 ]--- > > > > > > Root cause: > > > > Lets Take an example of reading two pages. > > > > Thread1: > > > > nfs_readpages > > nfs_readpages_from_fscache > > __nfs_readpages_from_fscache > > __fscache_read_or_alloc_pages(cookie,mapping,pages, nr_pages, > end_io_func, context , gfp) > > op = fscache_alloc_retrieval(cookie, mapping, end_io_func, > context) (op->usage =1) > > |cachefiles_read_or_alloc_pages > > |cachefiles_read_backing_file > > |op->op.processor = cache_files_read_copier > > for each netpage in list of netpages > > | allocate a monitor(cachefiles_one_read) one for each page ( > tracker for netfs page , cachefs_page) > > | fscache_get_retreival(op) (op->usage ++) > > | find backing page. allocate if not found > > | read_backing_page: bmapping->a_ops->readpage(NULL,backpage) > - read the page from disk > > | add_page_wait_queue(monitor->monitor, > cachefiles_read_waiter) > > | fscache_put_retrieval(op) (op->usage --) > > > > So for a simple two page read , op->usage is at 2 (one for each page) > > > > Thread 2 > > cachefiles_read_waiter (monitor) > > fscache_enqueue_retrieval > > add the monitor to op->to_do list > > ##### slight DELAY for second page, so the op->to_do list has two > pages, #### > > ##### op->usage is at 3 (2 from above + 1 for the first page) > ######### > > fscache_enqueue_operation > > ## so by the time the second page is being queued, the op is > freed and usage is at 0. ### > > fscache_get_retreival(op) (op->usage ++) > > queue_work(fscache_op_wq, &op->work) op->processor is set > to cache_files_read_copier > > > > > > So for a simple two page read , op->usage is at 4(GOOD case) or 1 (BAD > case) > > > > Thread 3 > > fscache_op_work_func(op) > > cachefiles_read_copier() > > ### loops two times for two pages as the op->to_do has two pages > > while (op->to_do) --> monitor { > > copy_highpage(monitor->netfs_page, monitor->back_page); > > fscache_mark_page_cached(monitor->op, monitor->netfs_page); > > put_page(monitor->back_page); > > fscache_end_io(op, monitor->netfs_page, error); --> notify > page copy done to Netfs > > put_page(monitor->netfs_page); > > fscache_retrieval_complete(op, 1); > > fscache_put_retrieval(op); (op->usage--) > > kfree(monitor); > > } > > fscache_put_retrieval(op); (op->usage --) > > # if the first callback finishes before the delay, op->usage is at 0 > and the memory for OP is freed and potentially reused > > # if the first callback pushes into the cache->op_gc_list , we may end > up with ASSERTION FAILED (5 == 6) > > # ASSERTIFCMP(op->state != FSCACHE_OP_ST_INITIALISED && > op->state != FSCACHE_OP_ST_COMPLETE, op->state, ==, > FSCACHE_OP_ST_CANCELLED); > > > > > > #possible fix: > > #clearly adding the page to to_do list in cachefiles_read_copier prior > to bumping the reference is a problem. bump up the op->usage before queuing > under the object. This ensures the op is not freed. (op->usage is atleast > 1) > > # enqueue the operation for second page. op->usage is 2 > > # decrement the op->usage so that the second callback frees properly. > > > > static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode, > > int sync, > void *_key) > > { > > ... > > ... > > > > spin_lock(&object->work_lock); > > + fscache_get_retrieval(monitor->op); > > list_add_tail(&monitor->op_link, &monitor->op->to_do); > > spin_unlock(&object->work_lock); > > > > fscache_enqueue_retrieval(monitor->op); > > + fscache_put_retrieval(monitor->op); > > return 0; > > } > > > > > > Thanks > > Kiran > > -- Kiran
Attachment:
cachefiles_fix.patch
Description: Binary data
Attachment:
fscache_fix.patch
Description: Binary data
-- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cachefs