RE: [5.10 CAN BUG report] kernel dump about echo skb

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> -----Original Message-----
> From: Vincent MAILHOL <mailhol.vincent@xxxxxxxxxx>
> Sent: 2021年1月22日 17:52
> To: Oleksij Rempel <o.rempel@xxxxxxxxxxxxxx>
> Cc: Marc Kleine-Budde <mkl@xxxxxxxxxxxxxx>; linux-can@xxxxxxxxxxxxxxx;
> Joakim Zhang <qiangqing.zhang@xxxxxxx>; kernel@xxxxxxxxxxxxxx; Oliver
> Hartkopp <socketcan@xxxxxxxxxxxx>
> Subject: Re: [5.10 CAN BUG report] kernel dump about echo skb
> 
> On Fri 22 Jan 2021 at 18:40, Oleksij Rempel <o.rempel@xxxxxxxxxxxxxx>
> wrote:
> > On Fri, Jan 22, 2021 at 10:23:24AM +0100, Marc Kleine-Budde wrote:
> > > On Fri, Jan 22, 2021 at 06:08:18PM +0900, Vincent MAILHOL wrote:
> > > > > > 1. Kernel dump log:
> > > > > > [  101.688327] ------------[ cut here ]------------ [  101.692968]
> refcount_t:
> > > > > > addition on 0; use-after-free.
> > > >
> > > > The skb already had its refcount at zero when reaching
> > > > can_put_echo_skb(). It is as if it was already freed/consumed!
> > >
> > > ACK
> > >
> > > > If you remove Oleksij’s patch, can_put_echo_skb() will probably
> > > > not clone the skb and thus not check the refcount: this means that
> > > > you will not see the warning, however, it does not necessarily
> > > > mean that the bug did not occur.
> > >
> > > ACK
> > >
> > > > So far, it seems to me to be another bug which was invisible until
> > > > now and which Oleksij’s patch just uncovered. But I do not yet
> > > > fully understand what the root cause could be.
> > >
> > > Or it's the same bug, hitting earlier. Oleksij's backtrace was in
> > > the TX-complete path and the problem was fixes by cloning the skb in before
> TX.
> > > This means the refcount of the original skb was decremented between
> > > TX and TX-complete. Here the refcount is decremented even before TX.
> > >
> > > Does this make sense?
> 
> Makes absolute sense.
> 
> That would mean that Olesij’s patch addressed the issue but too late and that
> somewhere earlier in the TX path, the refcount is either:
>   - decremented where it shouldn’t be
>   - not incremented where it should be
> 
> Assuming that the skb reaches the xmit() funcion, what should be its refcount?
> 2 if the socket is still opened and 1 if the socket is closed?
> Something else?
> 
> > Is socket was closed just before TX?
> 
> Good question!
> In addition, I would like to know which protocol is used (raw_can, j1939,
> other?)



Hi Marc, Oleksij, Vincent, 

Thanks a lot for you looking into this issue. I still can't reproduce it manually, only can be reproduced when run auto scripts.

Below is test scripts for this:

1) Server side:
# Function:     test_can_02
# Description   CAN bitrate support test
# auto-manual   Partially auto
# Run time      5 min
test_can_02()
{
	RC=2
	TCID="test_can_02"
    TST_COUNT=2
    tcid_now=$TCID
	tcid_next="test_can_03"

    echo "=============================="
    echo "=     start $TCID      ="
	server_init || exit $RC
	echo $TCID >> $Log_file
    #STAND_BIT_RATES="50000 20000 10000"
    #BITRATE_DATA=("00 05 00 00" "00 02 00 00" "00 01 00 00")
    STAND_BIT_RATES="1000000 500000 250000 125000 50000 20000 10000"
    BITRATE_DATA=("01 00 00 00" "00 50 00 00" "00 25 00 00" "00 12 50 00" "00 05 00 00" "00 02 00 00" "00 01 00 00")
	local i=0
    bitrate_index=0
    for i in $STAND_BIT_RATES
    do
        echo "Now test bitrate $i"
        ifconfig $CANID down
        ip link set $CANID  type can bitrate $i || exit $RC
        ifconfig $CANID up || exit $RC
        candump $CANID > $Log_file &
        bgpid=$!
        ltime=0
        while [ $ltime -lt 10 ]; do
            over_status=$(cat $Log_file | grep "${BITRATE_DATA[$bitrate_index]}" | cut -d ']' -f '2')
            if [ -n "$over_status" ]; then
                echo "support bitrate $i"
                bitrate_index=`expr $bitrate_index + 1`
                cansend $CANID 777#1122334455667788
                kill -9 $bgpid > /dev/null 2>&1
                break
            fi
            let ltime=ltime+1
            sleep 5
        done
        if [ $ltime -eq 10 ]; then
            tst_resm TFAIL "Don't support $i bitrate !"
            kill -9 $bgpid > /dev/null 2>&1
            cat $Log_file
            exit 1
        fi
    done
    echo $tcid_next >> $Log_file

    tst_resm TPASS "test PASS"

    RC=0
    return $RC
}

2) Client side:
# Function:     test_can_02
# Description   CAN bitrate support test
# auto-manual   Partially auto
# Run time      5 min
test_can_02()
{
	RC=2
	TCID="test_can_02"
    TST_COUNT=2

    echo "=============================="
    echo "=     start $TCID      ="
	client_init || exit $RC
    kill -9 $bgpid > /dev/null 2>&1
    #STAND_BIT_RATES="50000 20000 10000"
    #BITRATE_DATA=(00050000 00020000 00010000)
    STAND_BIT_RATES="1000000 500000 250000 125000 50000 20000 10000"
    BITRATE_DATA=(01000000 00500000 00250000 00125000 00050000 00020000 00010000)
	local i=0
    bitrate_index=0
    for i in $STAND_BIT_RATES
    do
        echo "Now test bitrate $i"
        ifconfig $CANID down
        ip link set $CANID type can bitrate $i || exit $RC
        ifconfig $CANID up
        candump $CANID,777:C00007FF > $Log_file &
        cpid=$!

        candata="123#${BITRATE_DATA[$bitrate_index]}"
        #add more sleep time to make sure server side is ready before sending
        cansend $CANID $candata || exit $RC
        cansend $CANID $candata || exit $RC
        cansend $CANID $candata || exit $RC
        bitrate_index=`expr $bitrate_index + 1`

	    local timeout=10
	    over_status=$(cat $Log_file | grep 777)
	    while [ -z "$over_status" ]
	    do
            if [ $timeout -le 0 ]; then
                rm -rf $Log_file
                kill -9 $cpid > /dev/null 2>&1
                tst_resm TFAIL "test FAIL"
                return $RC
            else
                echo "wait server side to finish the calculate"
                sleep 5
                timeout=$(($timeout - 1))
            fi
	        over_status=$(cat $Log_file | grep 777)
	    done
        echo "test bitrate $i finish"
        kill -9 $cpid > /dev/null 2>&1
        rm -rf $Log_file
    done

    tst_resm TPASS "test PASS"

    RC=0
    return $RC
}

I also suspect it is the same issue as Oleksij reported. During this test case, CAN device up/down much frequently, which could trigger this problem.
What could I do to provide more info?

Best Regards,
Joakim Zhang
> 
> Yours sincerely,
> Vincent
> 
> > Regards,
> > Oleksij
> > --
> > Pengutronix e.K.                           |
> |
> > Steuerwalder Str. 21                       |
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.pe
> ngutronix.de%2F&amp;data=04%7C01%7Cqiangqing.zhang%40nxp.com%7C6d
> 93aec25d224120a07908d8bebb5cf5%7C686ea1d3bc2b4c6fa92cd99c5c301635
> %7C0%7C0%7C637469059210927498%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C30
> 00&amp;sdata=6JQbTYQYPswNnDQlYIa%2FpRlh9tfeZktrqFInOTCVYX0%3D&a
> mp;reserved=0  |
> > 31137 Hildesheim, Germany                  | Phone:
> +49-5121-206917-0    |
> > Amtsgericht Hildesheim, HRA 2686           | Fax:
> +49-5121-206917-5555 |




[Index of Archives]     [Automotive Discussions]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [CAN Bus]

  Powered by Linux