RE: BUG: Reordering of L2CAP connection pending/accesspted replies

"Ilia, Kolominsky" <iliak@xxxxxx> · Tue, 27 Dec 2011 11:58:54 +0000

Hi Marcel

> Hi Ilia,
> 
> > > > I have encountered an incorrect behavior of l2cap connection
> > > > establishment mechanism when handling an incoming connection
> > > > request:
> > > >
> > > > > ACL data: handle 1 flags 0x02 dlen 12
> > > >     L2CAP(s): Connect req: psm 23 scid 0x0083
> > > > < ACL data: handle 1 flags 0x00 dlen 16
> > > >     L2CAP(s): Connect rsp: dcid 0x0040 scid 0x0083 result 0
> status 0
> > > >       Connection successful
> > > > < HCI Command: Exit Sniff Mode (0x02|0x0004) plen 2
> > > >     handle 1
> > > > < ACL data: handle 1 flags 0x00 dlen 12
> > > >     L2CAP(s): Config req: dcid 0x0083 flags 0x00 clen 0
> > > > > HCI Event: Mode Change (0x14) plen 6
> > > >     status 0x00 handle 1 mode 0x00 interval 0
> > > >     Mode: Active
> > > > < ACL data: handle 1 flags 0x00 dlen 16
> > > >     L2CAP(s): Connect rsp: dcid 0x0040 scid 0x0083 result 1
> status 2
> > > >       Connection pending - Authorization pending
> > > >
> > > > After analyzing the code, it seems to me that there is indeed a
> > > > clear possibility that replies will egress out of order on
> > > > multicore systems:
> > > >
> > > > CPU0 (Tasklet: hci_rx_task)          CPU1 (user process)
> > >
> > > Can you check if this also happens after the move to workqueue
> > > processing?
> > > The workqueue handling is quite different, then this problem might
> not
> > > be
> > > there anymore.
> >
> > Firstly, I think workqueue should only make the matters worse -
> > since it can be preempted ( unlike tasklets ) this can
> > happen even on single CPU. ) e.g. resched just before send_resp
> label).
> > Secondly, as with any race situations, this bug is difficult to
> reproduce,
> > I saw it only a couple of times, thus I call for theoretical
> analysis.
> 
> we are actually using a CPU unbound workqueue where the kernel ensures
> that only one will be active across the set of CPUs. Both RX and TX are
> executed from that same workqueue. So the only way this can happen is
> if
> one work is scheduled from the other. However since the event
> processing
> is now also run from that same workqueue, I fail to see how that could
> happen.

I am putting back the original diagram because I feel that it is
quite relevant to the discussion:

CPU0 (Tasklet: hci_rx_task)          CPU1 (user process)
...                                  sk = sys_accept()
...                                    l2cap_sock_accept()
...                                    add_wait_queue_exclusive()
l2cap_connect_req()                  ...
  result = L2CAP_CR_PEND;            ... 
  status = L2CAP_CS_AUTHOR_PEND;     ...
  parent->sk_data_ready(parent, 0)   ...
  ...                                sys_recvmsg(sk,...)
  ...					       l2cap_sock_recvmsg()
  ...						   __l2cap_connect_rsp_defer()
  ...						     <Send L2CAP_CR_SUCCESS>
  ...
  <Send L2CAP_CR_PEND>            ...

The fact that both RX and TX are executed from the same workqueue
does not help here, because the issue here is the order of 
skb_queue_tail calls (l2cap_send_cmd->hci_send_acl->skb_queue_tail).
One call can be made while in workqueue(prev. tasklet), the other
while serving system call ( CPU1 ) and there seems to be 
no synchronizing mechanism between them.

> 
> Regards
> 
> Marcel
> 
Regards,
Ilia.

��.n��������+%������w��{.n�����{����^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�