> -----Original Message----- > From: Vincent MAILHOL <mailhol.vincent@xxxxxxxxxx> > Sent: 2021年1月22日 17:52 > To: Oleksij Rempel <o.rempel@xxxxxxxxxxxxxx> > Cc: Marc Kleine-Budde <mkl@xxxxxxxxxxxxxx>; linux-can@xxxxxxxxxxxxxxx; > Joakim Zhang <qiangqing.zhang@xxxxxxx>; kernel@xxxxxxxxxxxxxx; Oliver > Hartkopp <socketcan@xxxxxxxxxxxx> > Subject: Re: [5.10 CAN BUG report] kernel dump about echo skb > > On Fri 22 Jan 2021 at 18:40, Oleksij Rempel <o.rempel@xxxxxxxxxxxxxx> > wrote: > > On Fri, Jan 22, 2021 at 10:23:24AM +0100, Marc Kleine-Budde wrote: > > > On Fri, Jan 22, 2021 at 06:08:18PM +0900, Vincent MAILHOL wrote: > > > > > > 1. Kernel dump log: > > > > > > [ 101.688327] ------------[ cut here ]------------ [ 101.692968] > refcount_t: > > > > > > addition on 0; use-after-free. > > > > > > > > The skb already had its refcount at zero when reaching > > > > can_put_echo_skb(). It is as if it was already freed/consumed! > > > > > > ACK > > > > > > > If you remove Oleksij’s patch, can_put_echo_skb() will probably > > > > not clone the skb and thus not check the refcount: this means that > > > > you will not see the warning, however, it does not necessarily > > > > mean that the bug did not occur. > > > > > > ACK > > > > > > > So far, it seems to me to be another bug which was invisible until > > > > now and which Oleksij’s patch just uncovered. But I do not yet > > > > fully understand what the root cause could be. > > > > > > Or it's the same bug, hitting earlier. Oleksij's backtrace was in > > > the TX-complete path and the problem was fixes by cloning the skb in before > TX. > > > This means the refcount of the original skb was decremented between > > > TX and TX-complete. Here the refcount is decremented even before TX. > > > > > > Does this make sense? > > Makes absolute sense. > > That would mean that Olesij’s patch addressed the issue but too late and that > somewhere earlier in the TX path, the refcount is either: > - decremented where it shouldn’t be > - not incremented where it should be > > Assuming that the skb reaches the xmit() funcion, what should be its refcount? > 2 if the socket is still opened and 1 if the socket is closed? > Something else? > > > Is socket was closed just before TX? > > Good question! > In addition, I would like to know which protocol is used (raw_can, j1939, > other?) Hi Marc, Oleksij, Vincent, Thanks a lot for you looking into this issue. I still can't reproduce it manually, only can be reproduced when run auto scripts. Below is test scripts for this: 1) Server side: # Function: test_can_02 # Description CAN bitrate support test # auto-manual Partially auto # Run time 5 min test_can_02() { RC=2 TCID="test_can_02" TST_COUNT=2 tcid_now=$TCID tcid_next="test_can_03" echo "==============================" echo "= start $TCID =" server_init || exit $RC echo $TCID >> $Log_file #STAND_BIT_RATES="50000 20000 10000" #BITRATE_DATA=("00 05 00 00" "00 02 00 00" "00 01 00 00") STAND_BIT_RATES="1000000 500000 250000 125000 50000 20000 10000" BITRATE_DATA=("01 00 00 00" "00 50 00 00" "00 25 00 00" "00 12 50 00" "00 05 00 00" "00 02 00 00" "00 01 00 00") local i=0 bitrate_index=0 for i in $STAND_BIT_RATES do echo "Now test bitrate $i" ifconfig $CANID down ip link set $CANID type can bitrate $i || exit $RC ifconfig $CANID up || exit $RC candump $CANID > $Log_file & bgpid=$! ltime=0 while [ $ltime -lt 10 ]; do over_status=$(cat $Log_file | grep "${BITRATE_DATA[$bitrate_index]}" | cut -d ']' -f '2') if [ -n "$over_status" ]; then echo "support bitrate $i" bitrate_index=`expr $bitrate_index + 1` cansend $CANID 777#1122334455667788 kill -9 $bgpid > /dev/null 2>&1 break fi let ltime=ltime+1 sleep 5 done if [ $ltime -eq 10 ]; then tst_resm TFAIL "Don't support $i bitrate !" kill -9 $bgpid > /dev/null 2>&1 cat $Log_file exit 1 fi done echo $tcid_next >> $Log_file tst_resm TPASS "test PASS" RC=0 return $RC } 2) Client side: # Function: test_can_02 # Description CAN bitrate support test # auto-manual Partially auto # Run time 5 min test_can_02() { RC=2 TCID="test_can_02" TST_COUNT=2 echo "==============================" echo "= start $TCID =" client_init || exit $RC kill -9 $bgpid > /dev/null 2>&1 #STAND_BIT_RATES="50000 20000 10000" #BITRATE_DATA=(00050000 00020000 00010000) STAND_BIT_RATES="1000000 500000 250000 125000 50000 20000 10000" BITRATE_DATA=(01000000 00500000 00250000 00125000 00050000 00020000 00010000) local i=0 bitrate_index=0 for i in $STAND_BIT_RATES do echo "Now test bitrate $i" ifconfig $CANID down ip link set $CANID type can bitrate $i || exit $RC ifconfig $CANID up candump $CANID,777:C00007FF > $Log_file & cpid=$! candata="123#${BITRATE_DATA[$bitrate_index]}" #add more sleep time to make sure server side is ready before sending cansend $CANID $candata || exit $RC cansend $CANID $candata || exit $RC cansend $CANID $candata || exit $RC bitrate_index=`expr $bitrate_index + 1` local timeout=10 over_status=$(cat $Log_file | grep 777) while [ -z "$over_status" ] do if [ $timeout -le 0 ]; then rm -rf $Log_file kill -9 $cpid > /dev/null 2>&1 tst_resm TFAIL "test FAIL" return $RC else echo "wait server side to finish the calculate" sleep 5 timeout=$(($timeout - 1)) fi over_status=$(cat $Log_file | grep 777) done echo "test bitrate $i finish" kill -9 $cpid > /dev/null 2>&1 rm -rf $Log_file done tst_resm TPASS "test PASS" RC=0 return $RC } I also suspect it is the same issue as Oleksij reported. During this test case, CAN device up/down much frequently, which could trigger this problem. What could I do to provide more info? Best Regards, Joakim Zhang > > Yours sincerely, > Vincent > > > Regards, > > Oleksij > > -- > > Pengutronix e.K. | > | > > Steuerwalder Str. 21 | > https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pe > ngutronix.de%2F&data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C6d > 93aec25d224120a07908d8bebb5cf5%7C686ea1d3bc2b4c6fa92cd99c5c301635 > %7C0%7C0%7C637469059210927498%7CUnknown%7CTWFpbGZsb3d8eyJWIj > oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C30 > 00&sdata=6JQbTYQYPswNnDQlYIa%2FpRlh9tfeZktrqFInOTCVYX0%3D&a > mp;reserved=0 | > > 31137 Hildesheim, Germany | Phone: > +49-5121-206917-0 | > > Amtsgericht Hildesheim, HRA 2686 | Fax: > +49-5121-206917-5555 |