Re: inconsistent lock state on v4.14.20-rt17

Roosen Henri <Henri.Roosen@xxxxxxxxxxxxx> · Fri, 9 Mar 2018 09:47:16 +0000

On Thu, 2018-03-08 at 19:00 +0100, bigeasy@xxxxxxxxxxxxx wrote:
> On 2018-03-08 17:38:59 [+0000], Roosen Henri wrote:
> > > Is the backtrace, that you receive from lockdep, always the same
> > > or
> > > is
> > > it different sometimes?
> > 
> > It is different each time. So my gut feeling tells me it might be a
> > memory corruption of some kind.. maybe caused by a use after free
> > or
> > so..
> 
> CONFIG_SLUB_DEBUG_ON should (or could) catch this.

Thanks for pointing that out! I'll enable this for the next test run.
If there are more debug options which are of interest to switch on,
then please let me know.

> 
> > I restarted the target yesterday evening and this morning it was
> > frozen
> > without any trace on the terminal. Attaching a JTAG showed
> > different
> > call-stacks than yesterday; Core #2 (trying to print the info to
> > the
> > terminal) and #3 were spinning on a spin-lock, don't understand
> > what
> > core #0 and #1 were doing.
> 
> maybe #0 and #1 are idle but #2 and #3 should make progress. #2 looks
> like a warning, do you know where it is from or is this everything
> you
> get? Unless the warning comes from an atomic context you should see
> something on the UART.

#2 and #3 were not making progress, they kept on spinning at the
arch_spin_lock().

> 
> > Most of the times the call-stacks start at SyS_write() or
> > SyS_read()
> > from hackbench.
> 
> but what you posted was lockdep complaining about RQ lock.

Well, actually I've reported "since 4.9 we've been chasing random
kernel crashes", and the v4.14 now caught an inconsistent lock state.
The hope was that the trace for the inconsistent lock state pointed to
the root cause of the random kernel crashes.

> 
> > Some things I found out by testing on v4.9:
> > - minimum test to reproduce problem "while true; do hackbench -g
> > 100 -l
> > 1000; done &"
> > - reproducible with "hackbench -T" (threads)
> > - reproducible only on iMX6Q, not (yet) on iMX6S, iMX6D
> > - NOT reproducible with "hackbench -p" (pipes)
> 
> interesting.
> 
> > As that might be pointing towards the streaming unix socketpair
> > hackbench is using from multiple forked processes, I had a look at
> > net/unix/af_unix.c and wondered why unix_stream_sendmsg() doesn't
> > increase the reference count on the "other" socket the same as
> > unix_dgram_sendmsg() does. I don't see a reason why "other" is
> > handled
> > differently in both functions, so it smells fishy to me. But I'm
> > not
> > familiar with the net-code, so maybe you could review if the diff
> > below
> > would make sense:
> 
> Commit 830a1e5c212f ("[AF_UNIX]: Remove superfluous reference
> counting
> in unix_stream_sendmsg") claims that this is not required. But if
> your
> patch makes a difference then…

Okay, I didn't know the refcounting could be safely removed. The
overnight test with the change reproduced the inconsistent lock state
again, which proves indeed it makes no difference.

> 
> Sebastian

Thanks,
Henri��.n��������+%������w��{.n�����{�����ǫ���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f