Re: Possible bug in the communications layer ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 28, 2016 at 02:43:01PM +0200, Xavier Hernandez wrote:
> Hi,
> 
> I've seen what seems a bug in the communications layer. The first sign is an
> "XDR decoding failed" error in the logs.
> 
> This happens with Gluster 3.7.11 accessed through Ganesha and gfapi. The
> volume is a distributed-disperse 4*(4+2).
> 
> I'm able to reproduce the problem easily doing the following test:
> 
> iozone -t2 -s10g -r1024k -i0 -w -F <nfs mount>/iozone{1..2}.dat
> echo 3 >/proc/sys/vm/drop_caches
> iozone -t2 -s10g -r1024k -i1 -w -F <nfs mount>/iozone{1..2}.dat
> 
> The error happens soon after starting the read test.

Do you know if this only happens on disperse volumes, or also with
others? If you have captured a network trace, could you provide it to
me? You can use 'editcap -s0 ...' to copy only the relevant packets.
But, I dont have an issue to download a few GB either if that is easier
for you.

> As can be seen in the data below, client3_3_readv_cbk() is processing an
> iovec of 116 bytes, however it should be of 154 bytes (the buffer in memory
> really seems to contain 154 bytes). The data on the network seems ok (at
> least I haven't been able to identify any problem), so this must be a
> processing error on the client side.
> 
> The last field in cut buffer of the sequentialized data corresponds to the
> length of the xdata field: 0x26. So at least 38 more byte should be present.
> 
> My guess is that some corner case is hit reading fragmented network packets
> due to a high load.
> 
> Debug information:
> 
> Breakpoint 1, client3_3_readv_cbk (req=0x7f540e64106c, iov=0x7f540e6410ac,
> count=<optimized out>, myframe=0x7f54259a4d54) at client-rpc-fops.c:3021
> 3021                    gf_msg (this->name, GF_LOG_ERROR, EINVAL,
> (gdb) print *iov
> $1 = {iov_base = 0x7f53e994e018, iov_len = 116}
> (gdb) x/116xb 0x7f53e994e018
> 0x7f53e994e018: 0x00    0x00    0x80    0x00    0x00    0x00    0x00    0x00
> 0x7f53e994e020: 0xa8    0xbf    0xa3    0xe0    0x5f    0x48    0x4c    0x1e

Hmm, I'm not sure how this is layed out in memory. 0x80 would be one of
first bytes in RPC payload, it signals 'last record' for the RPC
procedure, and we only send one record anyway. The four bytes combined
like (0x80 0x.. 0x.. 0x..) should be (0x80 | rpc-record-size). Reading
this in Wireshark from a .pcap.gz is much easier :)

Don't hesitate to file a bug for this, and include the exact error
messages you get. If possible with network capture, but if you can not
share that publicly, please get it directly to me.

Thanks,
Niels


> 0x7f53e994e028: 0x80    0xa3    0x8a    0xd8    0x9d    0xa1    0x1c    0x75
> 0x7f53e994e030: 0x80    0xa3    0x8a    0xd8    0x9d    0xa1    0x1c    0x75
> 0x7f53e994e038: 0x00    0x00    0x00    0x00    0x00    0x00    0x08    0x00
> 0x7f53e994e040: 0x00    0x00    0x81    0xa0    0x00    0x00    0x00    0x01
> 0x7f53e994e048: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
> 0x7f53e994e050: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
> 0x7f53e994e058: 0x00    0x00    0x00    0x00    0xa0    0x00    0x00    0x00
> 0x7f53e994e060: 0x00    0x00    0x10    0x00    0x00    0x00    0x00    0x00
> 0x7f53e994e068: 0x00    0x50    0x00    0x00    0x57    0x22    0x04    0x1f
> 0x7f53e994e070: 0x25    0x38    0x92    0x91    0x57    0x22    0x04    0xb3
> 0x7f53e994e078: 0x03    0x53    0x1b    0x13    0x57    0x22    0x04    0xb3
> 0x7f53e994e080: 0x06    0xf5    0xe1    0x99    0x00    0x00    0x80    0x00
> 0x7f53e994e088: 0x00    0x00    0x00    0x26
> (gdb) x/154xb 0x7f53e994e018
> 0x7f53e994e018: 0x00    0x00    0x80    0x00    0x00    0x00    0x00    0x00
> 0x7f53e994e020: 0xa8    0xbf    0xa3    0xe0    0x5f    0x48    0x4c    0x1e
> 0x7f53e994e028: 0x80    0xa3    0x8a    0xd8    0x9d    0xa1    0x1c    0x75
> 0x7f53e994e030: 0x80    0xa3    0x8a    0xd8    0x9d    0xa1    0x1c    0x75
> 0x7f53e994e038: 0x00    0x00    0x00    0x00    0x00    0x00    0x08    0x00
> 0x7f53e994e040: 0x00    0x00    0x81    0xa0    0x00    0x00    0x00    0x01
> 0x7f53e994e048: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
> 0x7f53e994e050: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
> 0x7f53e994e058: 0x00    0x00    0x00    0x00    0xa0    0x00    0x00    0x00
> 0x7f53e994e060: 0x00    0x00    0x10    0x00    0x00    0x00    0x00    0x00
> 0x7f53e994e068: 0x00    0x50    0x00    0x00    0x57    0x22    0x04    0x1f
> 0x7f53e994e070: 0x25    0x38    0x92    0x91    0x57    0x22    0x04    0xb3
> 0x7f53e994e078: 0x03    0x53    0x1b    0x13    0x57    0x22    0x04    0xb3
> 0x7f53e994e080: 0x06    0xf5    0xe1    0x99    0x00    0x00    0x80    0x00
> 0x7f53e994e088: 0x00    0x00    0x00    0x26    0x00    0x00    0x00    0x01
> 0x7f53e994e090: 0x00    0x00    0x00    0x17    0x00    0x00    0x00    0x02
> 0x7f53e994e098: 0x67    0x6c    0x75    0x73    0x74    0x65    0x72    0x66
> 0x7f53e994e0a0: 0x73    0x2e    0x69    0x6e    0x6f    0x64    0x65    0x6c
> 0x7f53e994e0a8: 0x6b    0x2d    0x63    0x6f    0x75    0x6e    0x74    0x00
> 0x7f53e994e0b0: 0x31    0x00
> (gdb) bt
> #0  client3_3_readv_cbk (req=0x7f540e64106c, iov=0x7f540e6410ac,
> count=<optimized out>, myframe=0x7f54259a4d54) at client-rpc-fops.c:3021
> #1  0x00007f542a677ab0 in rpc_clnt_handle_reply
> (clnt=clnt@entry=0x7f54101cdef0, pollin=pollin@entry=0x7f54000091f0) at
> rpc-clnt.c:764
> #2  0x00007f542a677d6f in rpc_clnt_notify (trans=<optimized out>,
> mydata=0x7f54101cdf20, event=<optimized out>, data=0x7f54000091f0) at
> rpc-clnt.c:925
> #3  0x00007f542a673853 in rpc_transport_notify
> (this=this@entry=0x7f54101ddb70,
> event=event@entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data@entry=0x7f54000091f0) at rpc-transport.c:546
> #4  0x00007f541d881666 in socket_event_poll_in
> (this=this@entry=0x7f54101ddb70) at socket.c:2237
> #5  0x00007f541d8842c4 in socket_event_handler (fd=fd@entry=30,
> idx=idx@entry=20, data=0x7f54101ddb70, poll_in=1, poll_out=0, poll_err=0) at
> socket.c:2350
> #6  0x00007f542a90aa4a in event_dispatch_epoll_handler
> (event=0x7f540effc540, event_pool=0xd3d9a0) at event-epoll.c:575
> #7  event_dispatch_epoll_worker (data=0x7f54100d0890) at event-epoll.c:678
> #8  0x00007f542d0fbdc5 in start_thread () from /lib64/libpthread.so.0
> #9  0x00007f542c7bc28d in clone () from /lib64/libc.so.6
> 
> Xavi
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux