Re: [PATCH 1/4] cachefiles: Fix assertion "6 == 5 is false" at fs/fscache/operation.c:494

Vegard Nossum <vegard.nossum@xxxxxxxxx> · Fri, 6 Jul 2018 10:31:24 +0200

On 6 July 2018 at 01:45, NeilBrown <neilb@xxxxxxxx> wrote:
> On Thu, Jul 05 2018, David Howells wrote:
>
>> From: kiran modukuri <kiran.modukuri@xxxxxxxxx>
>>
>> There is a potential race in fscache operation enqueuing for reading and
>> copying multiple pages from cachefiles to netfs.
>> Under some heavy load system, it will happen very often.
>>
>> If this race occurs, an oops similar to the following is seen:
>>
>>  kernel BUG at fs/fscache/operation.c:69!
>>  invalid opcode: 0000 [#1] SMP
>>  ...
>>  #0 [ffff883fff0838d8] machine_kexec at ffffffff81051beb
>>  #1 [ffff883fff083938] crash_kexec at ffffffff810f2542
>>  #2 [ffff883fff083a08] oops_end at ffffffff8163e1a8
>>  #3 [ffff883fff083a30] die at ffffffff8101859b
>>  #4 [ffff883fff083a60] do_trap at ffffffff8163d860
>>  #5 [ffff883fff083ab0] do_invalid_op at ffffffff81015204
>>  #6 [ffff883fff083b60] invalid_op at ffffffff8164701e
>>     [exception RIP: fscache_enqueue_operation+246]
>>     RIP: ffffffffa0b793c6  RSP: ffff883fff083c18  RFLAGS: 00010046
>>     RAX: 0000000000000019  RBX: ffff8832ed1a9ec0  RCX: 0000000000000006
>>     RDX: 0000000000000000  RSI: 0000000000000046  RDI: 0000000000000046
>>     RBP: ffff883fff083c20   R8: 0000000000000086   R9: 000000000000178f
>>     R10: ffffffff816aeb00  R11: ffff883fff08392e  R12: ffff8802f0525620
>>     R13: ffff88407ffc01d8  R14: 0000000000000000  R15: 0000000000000003
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
>>  #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
>>  #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
>>  #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028
>>
>> Reported-by: Lei Xue <carmark.dlut@xxxxxxxxx>
>> Reported-by: Vegard Nossum <vegard.nossum@xxxxxxxxx>
>> Reported-by: Anthony DeRobertis <aderobertis@xxxxxxxxxxx>
>> Reported-by: NeilBrown <neilb@xxxxxxxx>
>> Reported-by: Daniel Axtens <dja@xxxxxxxxxx>
>> Reported-by: KiranKumar Modukuri <kiran.modukuri@xxxxxxxxx>
>> Signed-off-by: David Howells <dhowells@xxxxxxxxxx>
>> ---

[...]

> Thanks - I like this approach.  Taking the extra reference makes it a
> lot more clear what is happening and why.

The changelog is a bit sparse, no? We have more info here:

https://lkml.org/lkml/2018/5/8/520
https://lkml.org/lkml/2018/7/3/1184

Why not crib some of that and explain the issue properly (or at
minimum link the previous threads)?

Thanks,

Vegard

--
Linux-cachefs mailing list
Linux-cachefs@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cachefs