On 22 February 2018 at 08:33, <carmark.dlut@xxxxxxxxx> wrote: > From: Lei Xue <carmark.dlut@xxxxxxxxx> > > There is a potential race in fscache operation enqueuing for reading and > copying multiple pages from cachefiles to netfs. > Under some heavy load system, it will happen very often. > > If this race occurs, an oops similar to the following is seen: > > kernel BUG at fs/fscache/operation.c:69! > invalid opcode: 0000 [#1] SMP > … > #0 [ffff883fff0838d8] machine_kexec at ffffffff81051beb > #1 [ffff883fff083938] crash_kexec at ffffffff810f2542 > #2 [ffff883fff083a08] oops_end at ffffffff8163e1a8 > #3 [ffff883fff083a30] die at ffffffff8101859b > #4 [ffff883fff083a60] do_trap at ffffffff8163d860 > #5 [ffff883fff083ab0] do_invalid_op at ffffffff81015204 > #6 [ffff883fff083b60] invalid_op at ffffffff8164701e > [exception RIP: fscache_enqueue_operation+246] > RIP: ffffffffa0b793c6 RSP: ffff883fff083c18 RFLAGS: 00010046 > RAX: 0000000000000019 RBX: ffff8832ed1a9ec0 RCX: 0000000000000006 > RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000046 > RBP: ffff883fff083c20 R8: 0000000000000086 R9: 000000000000178f > R10: ffffffff816aeb00 R11: ffff883fff08392e R12: ffff8802f0525620 > R13: ffff88407ffc01d8 R14: 0000000000000000 R15: 0000000000000003 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 > #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6 > #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48 > #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028 > > Signed-off-by: Lei Xue <carmark.dlut@xxxxxxxxx> > --- > fs/cachefiles/rdwr.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c > index 883bc7bb12c5..9d5d13e150fb 100644 > --- a/fs/cachefiles/rdwr.c > +++ b/fs/cachefiles/rdwr.c > @@ -58,9 +58,9 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode, > > spin_lock(&object->work_lock); > list_add_tail(&monitor->op_link, &monitor->op->to_do); > + fscache_enqueue_retrieval(monitor->op); > spin_unlock(&object->work_lock); > > - fscache_enqueue_retrieval(monitor->op); > return 0; > } Hi, Just wondering what the status of this patch is? We've been hitting a similar problem and arrived at the same patch as a potential fix for it. Our crashes look like this: WARNING: CPU: 0 PID: 120693 at kernel/workqueue.c:618 insert_work+0x5f/0x70 Modules linked in: nbd CPU: 0 PID: 120693 Comm: sh Not tainted 4.16.2-0 #1 Hardware name: Oracle Corporation Sun Fire X4800/20434, BIOS 11080200 08/12/2016 RIP: 0010:insert_work+0x5f/0x70 RSP: 0018:ffff88103fa039b8 EFLAGS: 00010046 RAX: ffff88103f443f00 RBX: ffff880187c37c00 RCX: 0000000000000005 RDX: ffff880187c37c20 RSI: ffff8807c04dec00 RDI: 0000000000000000 RBP: ffff88103fa039c8 R08: 0000000000000101 R09: 0000000000000001 R10: ffff887eee68fd40 R11: 0000000000000001 R12: ffff88503fafc600 R13: 000000000001cf60 R14: ffff880187c37c00 R15: ffff88103f443f00 FS: 0000000000000000(0000) GS:ffff88103fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f394d2780a0 CR3: 0000000bcc936000 CR4: 00000000000006f0 Call Trace: <IRQ> __queue_work+0x11f/0x320 queue_work_on+0x19/0x30 fscache_enqueue_operation+0x83/0x160 cachefiles_read_waiter+0xd2/0x130 __wake_up_common+0x81/0x120 __wake_up_locked_key_bookmark+0x16/0x20 wake_up_page_bit+0x97/0xe0 unlock_page+0x20/0x30 page_endio+0x21/0xa0 mpage_end_io+0x41/0x60 bio_endio+0x78/0x90 dec_pending+0x140/0x250 ? linear_status+0x40/0x40 clone_endio+0x86/0x100 bio_endio+0x78/0x90 blk_update_request+0x8d/0x2b0 scsi_end_request+0x36/0x200 scsi_io_completion+0x12a/0x5e0 scsi_finish_command+0xf2/0x150 scsi_softirq_done+0x13e/0x160 __blk_mq_complete_request+0xb8/0x180 blk_mq_complete_request+0x57/0x70 scsi_mq_done+0x10/0x20 megasas_complete_cmd+0xdf/0x620 megasas_complete_cmd_dpc+0x8f/0x100 tasklet_action+0x9a/0xb0 __do_softirq+0xbf/0x1c8 irq_exit+0x9c/0xb0 do_IRQ+0x5b/0xe0 common_interrupt+0xf/0xf </IRQ> RIP: 0010:_raw_spin_unlock_irqrestore+0x9/0x10 RSP: 0018:ffffc900309e3cf8 EFLAGS: 00000296 ORIG_RAX: ffffffffffffffde RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000000001 RDX: ffffea0006793fe0 RSI: 0000000000000296 RDI: ffff88107ffff800 RBP: ffffc900309e3cf8 R08: 0000000000000002 R09: 000000000011b912 R10: 00000000000000e7 R11: 0000000000000000 R12: ffffea0014baa000 R13: ffff88103fa1d120 R14: ffff88107fff6000 R15: ffff88107fff6000 pagevec_lru_move_fn+0xb7/0xe0 ? pagevec_move_tail_fn+0x350/0x350 __pagevec_lru_add+0x12/0x20 lru_add_drain_cpu+0xc4/0xe0 lru_add_drain+0x10/0x20 exit_mmap+0x58/0x190 ? __handle_mm_fault+0x9a4/0x1540 ? hrtimer_try_to_cancel+0x1b/0xa0 mmput+0x4e/0x100 do_exit+0x22f/0xa10 do_group_exit+0x3a/0xa0 SyS_exit_group+0x12/0x20 do_syscall_64+0x61/0x110 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f394d325fa8 RSP: 002b:00007ffda407e668 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f394d603830 RCX: 00007f394d325fa8 RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffffa8 R10: 00007f394d6097e0 R11: 0000000000000246 R12: 00007f394d603830 R13: 00007ffda407f8b2 R14: 0000000000000000 R15: 0000000000000000 Code: 48 89 4e 10 4c 89 01 e8 30 f8 ff ff f0 83 44 24 fc 00 8b 83 40 03 00 00 85 c0 75 08 f0 ff ff 48 83 c4 08 5b 5d c3 <0f> 0b eb b7 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 ---[ end trace e5ddade8fd0233c4 ]--- Some inconclusive notes, FWIW: First we get the workqueue warning where it's hitting the warning in set_work_data() where it checks that the work is not pending: static inline void set_work_data(struct work_struct *work, unsigned long data, unsigned long flags) { WARN_ON_ONCE(!work_pending(work)); atomic_long_set(&work->data, data | flags | work_static(work)); } This is odd because the flag should have been set just a few moments before, in queue_work_on(): bool queue_work_on(int cpu, struct workqueue_struct *wq, struct work_struct *work) { bool ret = false; unsigned long flags; local_irq_save(flags); if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) { __queue_work(cpu, wq, work); ret = true; } local_irq_restore(flags); return ret; } Then we have the next part giving a few more clues: FS-Cache: FS-Cache: Assertion failed FS-Cache: 6 == 5 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:494! invalid opcode: 0000 [#1] SMP Modules linked in: nbd CPU: 1 PID: 122494 Comm: kworker/u257:3 Tainted: G W 4.16.2-0 #1 Hardware name: Oracle Corporation Sun Fire X4800/20434, BIOS 11080200 08/12/2016 Workqueue: fscache_operation fscache_op_work_func RIP: 0010:fscache_put_operation+0x1e3/0x1f0 RSP: 0018:ffffc9002ff8be08 EFLAGS: 00010296 RAX: 0000000000000019 RBX: ffff8807c04dec00 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff88103fa56398 RDI: ffff88103fa56398 RBP: ffffc9002ff8be18 R08: 0000000000000000 R09: 0000000000000771 R10: ffffffff820afc80 R11: 0000000000000770 R12: ffff8807c04dec00 R13: ffff88103f443f00 R14: 0000000000000000 R15: ffff880187c37c00 FS: 0000000000000000(0000) GS:ffff88103fa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000002343008 CR3: 000000000240a000 CR4: 00000000000006e0 Call Trace: fscache_op_work_func+0x26/0x50 process_one_work+0x131/0x290 worker_thread+0x45/0x360 kthread+0xf8/0x130 ? create_worker+0x190/0x190 ? kthread_cancel_work_sync+0x10/0x10 ret_from_fork+0x1f/0x30 Code: 31 c0 e8 26 11 ef ff 48 c7 c7 92 ee 2c 82 31 c0 e8 18 11 ef ff 8b 73 40 ba 05 00 00 ad 2a 82 31 c0 e8 02 11 ef ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 8 RIP: fscache_put_operation+0x1e3/0x1f0 RSP: ffffc9002ff8be08 ---[ end trace e5ddade8fd0233c5 ]--- The assertion comes from fscache_put_operation(): ASSERTIFCMP(op->state != FSCACHE_OP_ST_INITIALISED && op->state != FSCACHE_OP_ST_COMPLETE, op->state, ==, FSCACHE_OP_ST_CANCELLED); It expects op->state to be either INITIALISED, COMPLETE, or CANCELLED, but it is in fact DEAD.The only place that sets it to DEAD is fscache_put_operation() itself (a few lines after this assert). So we somehow have an operation that manages to have its refcount drop to 0 twice in a row -- without hitting the assertion that checks for a positive refcount. We can tell from the first stack trace that we were just in fscache_enqueue_operation(), which has the following code: atomic_inc(&op->usage); if (!queue_work(fscache_op_wq, &op->work)) fscache_put_operation(op); The reason we don't hit the positive refcount assertion is that the atomic_inc() above probably brings it from 0 to 1. It does smell a lot like a race causing premature free + use-after-free explaining the weird diagnostics. It takes about 15-20 minutes to reproduce the crash, with the patch above it's gone, but we can test alternative patches. Vegard -- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cachefs