On 01.03.2018 23:20, Andreas Dilger wrote: > > On Mar 1, 2018, at 9:04 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: >> This doesn't seem to make sense; the PC is where we are currently >> executing, and LR is the "Link Register" where the flow of control >> will be returning after the current function returns, right? Well, >> dx_probe should *not* be returning to __wait_on_bit(). So this just >> seems.... weird. >> >> Ignoring the LR register, this stack trace looks sane... I can't see >> which pointer could be NULL and getting dereferenced, though. How >> easily can you reproduce the problem? Can you either (a) translate >> the PC into a line number, or better yet, if you can reproduce, add a >> series of BUG_ON's so we can see what's going on? Ted, thank you for the suggestion. I don't have a bug-reproducer, it happens only under some IO load and quite randomly. I've applied the BUG_ON()'s, but it may take some time to catch the bug again. >> + BUG_ON(frame); > > I think you mean: > BUG_ON(frame == NULL); > or > BUG_ON(!frame); > > >> memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0])); >> frame->bh = ext4_read_dirblock(dir, 0, INDEX); >> if (IS_ERR(frame->bh)) >> return (struct dx_frame *) frame->bh; >> >> + BUG_ON(frame->bh); >> + BUG_ON(frame->bh->b_data); > > Same here. > > BUG_ON(frame->bh == NULL); > BUG_ON(frame->bh->b_data == NULL); > > This is why I don't like implicit "is NULL" or "is non-zero" usage. Lustre > used to require "== NULL" or "!= NULL" to avoid bugs like this, but had to > abandon that because of upstream code style. Well spotted, thanks Andreas. >> root = (struct dx_root *) frame->bh->b_data; >> if (root->info.hash_version != DX_HASH_TEA && >> root->info.hash_version != DX_HASH_HALF_MD4 && >> root->info.hash_version != DX_HASH_LEGACY) { >> >> These are "could never" happen scenarios from looking at the code, but >> that will help explain what is going on. >> >> If this is reliably only happening with mq, the only way I could see >> that if is something is returning an error when it previously wasn't. >> This isn't a problem we're seeing with any of our testing, though.