On Thu, 2018-03-08 at 16:57 +0100, bigeasy@xxxxxxxxxxxxx wrote: > On 2018-03-07 08:26:40 [+0000], Roosen Henri wrote: > > > The backtrace suggests that the rq lock is taken with interrupts > > > disabled and then with interrupts enabled. But based on the call- > > > trace > > > it should be with interrupts disabled in both cases. > > > I do have a imx6q running hackbench on a regular basis and I > > > haven't > > > seen this. Do you see this backtrace on every hackbench > > > invocation or > > > just after some time. The uptime suggest after ~5 hours. > > > > The crashes are not seen at every invocation of hackbench, but > > unfortunately after a few hours to after a few weeks. The test is > > based > > on a "dohell.sh" script from the Xenomai project. > > I kicked a hackbench on a imx6 with your config. Thanks Sebastian for looking into this! > Is the backtrace, that you receive from lockdep, always the same or > is > it different sometimes? It is different each time. So my gut feeling tells me it might be a memory corruption of some kind.. maybe caused by a use after free or so.. I restarted the target yesterday evening and this morning it was frozen without any trace on the terminal. Attaching a JTAG showed different call-stacks than yesterday; Core #2 (trying to print the info to the terminal) and #3 were spinning on a spin-lock, don't understand what core #0 and #1 were doing. Core#0: __vectors_start() at entry-armv.S:1,234 0xffff000c __pabt_svc() at entry-armv.S:319 0xc010cd84 Core#1: __vectors_start() at entry-armv.S:1,234 0xffff000c __pabt_svc() at entry-armv.S:319 0xc010cd84 Core#2: arch_spin_lock() at spinlock.h:77 0xc017bf38 do_raw_spin_lock() at spinlock_debug.c:115 0xc017bf38 __raw_spin_lock() at spinlock_api_smp.h:143 0xc0996158 _raw_spin_lock() at spinlock.c:155 0xc0996158 vprintk_emit() at printk.c:1,804 0xc0183a8c vprintk_default() at printk.c:1,906 0xc0183e8c vprintk_func() at printk_safe.c:382 0xc0185764 printk() at printk.c:1,939 0xc0184fc4 __warn() at panic.c:523 0xc0123754 warn_slowpath_fmt() at panic.c:567 0xc0123870 Core#3: arch_spin_lock() at spinlock.h:77 0xc017bf38 do_raw_spin_lock() at spinlock_debug.c:115 0xc017bf38 __raw_spin_lock() at spinlock_api_smp.h:143 0xc0996158 _raw_spin_lock() at spinlock.c:155 0xc0996158 __task_rq_lock() at core.c:105 0xc015347c rt_mutex_setprio() at core.c:3,792 0xc0157148 rt_mutex_adjust_prio() at rtmutex.c:380 0xc017a100 task_blocks_on_rt_mutex() at rtmutex.c:1,374 0xc017a780 rt_spin_lock_slowlock_locked() at rtmutex.c:1,061 0xc09941ec rt_spin_lock_slowlock() at rtmutex.c:1,116 0xc099445c rt_spin_lock_fastlock() at rtmutex.c:979 0xc09966c4 rt_spin_lock() at rtmutex.c:1,146 0xc09966c4 skb_queue_tail() at skbuff.c:2,915 0xc076eb08 unix_stream_sendmsg() at af_unix.c:1,908 0xc081dcf8 sock_sendmsg_nosec() at socket.c:639 0xc0760214 sock_sendmsg() at socket.c:643 0xc0760214 sock_write_iter() at socket.c:912 0xc07602b8 call_write_iter() at fs.h:1,771 0xc0255c68 new_sync_write() at read_write.c:469 0xc0255c68 __vfs_write() at read_write.c:482 0xc0255c68 vfs_write() at read_write.c:544 0xc0255e2c SYSC_write() at read_write.c:589 0xc0255fd4 SyS_write() at read_write.c:581 0xc0255fd4 0xc01074e0 Most of the times the call-stacks start at SyS_write() or SyS_read() from hackbench. Some things I found out by testing on v4.9: - minimum test to reproduce problem "while true; do hackbench -g 100 -l 1000; done &" - reproducible with "hackbench -T" (threads) - reproducible only on iMX6Q, not (yet) on iMX6S, iMX6D - NOT reproducible with "hackbench -p" (pipes) As that might be pointing towards the streaming unix socketpair hackbench is using from multiple forked processes, I had a look at net/unix/af_unix.c and wondered why unix_stream_sendmsg() doesn't increase the reference count on the "other" socket the same as unix_dgram_sendmsg() does. I don't see a reason why "other" is handled differently in both functions, so it smells fishy to me. But I'm not familiar with the net-code, so maybe you could review if the diff below would make sense: diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 7f46bab4ce5c..5f2ca91bc54d 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1854,7 +1854,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg, goto out_err; } else { err = -ENOTCONN; - other = unix_peer(sk); + other = unix_peer_get(sk); if (!other) goto out_err; } @@ -1910,7 +1910,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg, other->sk_data_ready(other); sent += size; } - + sock_put(other); scm_destroy(&scm); return sent; @@ -1923,6 +1923,8 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg, send_sig(SIGPIPE, current, 0); err = -EPIPE; out_err: + if (other) + sock_put(other); scm_destroy(&scm); return sent ? : err; } @@ -1934,13 +1936,13 @@ static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page, bool send_sigpipe = false; bool init_scm = true; struct scm_cookie scm; - struct sock *other, *sk = socket->sk; + struct sock *other = NULL, *sk = socket->sk; struct sk_buff *skb, *newskb = NULL, *tail = NULL; if (flags & MSG_OOB) return -EOPNOTSUPP; - other = unix_peer(sk); + other = unix_peer_get(sk); if (!other || sk->sk_state != TCP_ESTABLISHED) return -ENOTCONN; @@ -2027,6 +2029,7 @@ static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page, mutex_unlock(&unix_sk(other)->iolock); other->sk_data_ready(other); + sock_put(other); scm_destroy(&scm); return size; @@ -2038,6 +2041,8 @@ static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page, kfree_skb(newskb); if (send_sigpipe && !(flags & MSG_NOSIGNAL)) send_sig(SIGPIPE, current, 0); + if (other) + sock_put(other); if (!init_scm) scm_destroy(&scm); return err; > Sebastian Thanks, Henri > ��.n��������+%������w��{.n�����{�����ǫ���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f