----- Original Message ----- > From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Wednesday, May 4, 2016 5:37:56 PM > Subject: Re: Possible bug in the communications layer ? > > I think I've found the problem. > > 1567 case SP_STATE_READING_PROC_HEADER: > 1568 __socket_proto_read (priv, ret); > 1569 > 1570 /* there can be 'xdata' in read response, figure > it out */ > 1571 xdrmem_create (&xdr, proghdr_buf, default_read_size, > 1572 XDR_DECODE); > 1573 > 1574 /* This will fail if there is xdata sent from > server, if not, > 1575 well and good, we don't need to worry about */ > 1576 xdr_gfs3_read_rsp (&xdr, &read_rsp); > 1577 > 1578 free (read_rsp.xdata.xdata_val); > 1579 > 1580 /* need to round off to proper roof (%4), as XDR > packing pads > 1581 the end of opaque object with '0' */ > 1582 size = roof (read_rsp.xdata.xdata_len, 4); > 1583 > 1584 if (!size) { > 1585 frag->call_body.reply.accepted_success_state > 1586 = SP_STATE_READ_PROC_OPAQUE; > 1587 goto read_proc_opaque; > 1588 } > 1589 > 1590 __socket_proto_init_pending (priv, size); > 1591 > 1592 frag->call_body.reply.accepted_success_state > 1593 = SP_STATE_READING_PROC_OPAQUE; > 1594 > 1595 case SP_STATE_READING_PROC_OPAQUE: > 1596 __socket_proto_read (priv, ret); > 1597 > 1598 frag->call_body.reply.accepted_success_state > 1599 = SP_STATE_READ_PROC_OPAQUE; > > On line 1568 we read, at most, 116 bytes because we calculate the size > of a read response without xdata. Then we detect that we really need > more data for xdata (BTW, read_rsp.xdata.xdata_val will be always > allocated even if xdr_gfs3_read_rsp() fails ?) No. It need not be. Its guaranteed that only on a successful completion it is allocated. However, _if_ decoding fails only because xdr stream doesn't include xdata bits, but xdata_len is zero (by initializing it to default_read_size), then xdr library would've filled read_rsp.xdata.xdata_len (read_rsp.xdata.xdata_val can still be NULL). > > So we get into line 1596 with the pending info initialized to read the > remaining data. This is the __socket_proto_read macro: > > 166 /* This will be used in a switch case and breaks from the switch > case if all > 167 * the pending data is not read. > 168 */ > 169 #define __socket_proto_read(priv, ret) > \ > 170 { > \ > 171 size_t bytes_read = 0; > \ > 172 struct gf_sock_incoming *in; > \ > 173 in = &priv->incoming; > \ > 174 > \ > 175 __socket_proto_update_pending (priv); > \ > 176 > \ > 177 ret = __socket_readv (this, > \ > 178 in->pending_vector, 1, > \ > 179 &in->pending_vector, > \ > 180 &in->pending_count, > \ > 181 &bytes_read); > \ > 182 if (ret == -1) > \ > 183 break; > \ > 184 __socket_proto_update_priv_after_read (priv, ret, > bytes_read); \ > 185 } > > We read from the socket using __socket_readv(). It it fails, we quit. > However if the socket doesn't have more data to read, this function does > not return -1: > > 555 ret = __socket_cached_read (this, > opvector, opcount); > 556 > 557 if (ret == 0) { > 558 > gf_log(this->name,GF_LOG_DEBUG,"EOF on socket"); > 559 errno = ENODATA; > 560 ret = -1; > 561 } > 562 if (ret == -1 && errno == EAGAIN) { > 563 /* done for now */ > 564 break; > 565 } > 566 this->total_bytes_read += ret; > > If __socket_cached_read() fails with errno == EAGAIN, we break and > return opcount, which is >= 0. Causing the process to continue instead > of waiting for more data. No. If you observe, there is a call to another macro __socket_proto_update_priv_after_read at line 184. As can be seen in the definition below, if (ret > 0) - which is the case in case of short-read - we do break. I hope I am not missing anything here :). #define __socket_proto_update_priv_after_read(priv, ret, bytes_read) \ { \ struct gf_sock_incoming_frag *frag; \ frag = &priv->incoming.frag; \ \ frag->fragcurrent += bytes_read; \ frag->bytes_read += bytes_read; \ \ if ((ret > 0) || (frag->remaining_size != 0)) { \ if (frag->remaining_size != 0 && ret == 0) { \ __socket_proto_reset_pending (priv); \ } \ \ gf_log (this->name, GF_LOG_TRACE, \ "partial read on non-blocking socket"); \ \ break; \ } \ } > > As a side note, there's another problem here: if errno is not EAGAIN, > we'll update this->total_bytes_read substracting one. This shouldn't be > done when ret < 0. > > There are other places where ret is set to -1, but opcount is returned. > I guess that we should also set opcount = -1 on these places, but I > don't have a deep knowledge about this implementation. > > I've done a quick test checking for (ret != 0) instead of (ret == -1) in > __socket_proto_read() and it seemed to work. > > Could anyone with more knowledge about the communications layer verify > this and explain what would be the best solution ? > > Xavi > > On 29/04/16 14:52, Xavier Hernandez wrote: > > With your patch applied, it seems that the bug is not hit. > > > > I guess it's a timing issue that the new logging hides. Maybe no more > > data available after reading the partial readv header ? (it will arrive > > later). > > > > I'll continue testing... > > > > Xavi > > > > On 29/04/16 13:48, Raghavendra Gowdappa wrote: > >> Attaching the patch. > >> > >> ----- Original Message ----- > >>> From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > >>> To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > >>> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > >>> Sent: Friday, April 29, 2016 5:14:02 PM > >>> Subject: Re: Possible bug in the communications layer ? > >>> > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > >>>> To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > >>>> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > >>>> Sent: Friday, April 29, 2016 1:21:57 PM > >>>> Subject: Re: Possible bug in the communications layer ? > >>>> > >>>> Hi Raghavendra, > >>>> > >>>> yes, the readv response contains xdata. The dict length is 38 (0x26) > >>>> and, at the moment of failure, rsp.xdata.xdata_len already contains > >>>> 0x26. > >>> > >>> rsp.xdata.xdata_len having 0x26 even when decoding failed indicates > >>> that the > >>> approach used in socket.c to get the length of xdata is correct. > >>> However, I > >>> cannot find any other way of xdata going into payload vector other than > >>> xdata_len being zero. Just to be double sure, I've a patch containing > >>> debug > >>> message printing xdata_len when decoding fails in socket.c. Can you > >>> please > >>> apply the patch, run the tests and revert back with results? > >>> > >>>> > >>>> Xavi > >>>> > >>>> On 29/04/16 09:10, Raghavendra Gowdappa wrote: > >>>>> > >>>>> > >>>>> ----- Original Message ----- > >>>>>> From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > >>>>>> To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > >>>>>> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > >>>>>> Sent: Friday, April 29, 2016 12:36:43 PM > >>>>>> Subject: Re: Possible bug in the communications > >>>>>> layer ? > >>>>>> > >>>>>> > >>>>>> > >>>>>> ----- Original Message ----- > >>>>>>> From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > >>>>>>> To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > >>>>>>> Cc: "Jeff Darcy" <jdarcy@xxxxxxxxxx>, "Gluster Devel" > >>>>>>> <gluster-devel@xxxxxxxxxxx> > >>>>>>> Sent: Friday, April 29, 2016 12:07:59 PM > >>>>>>> Subject: Re: Possible bug in the communications > >>>>>>> layer ? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ----- Original Message ----- > >>>>>>>> From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx> > >>>>>>>> To: "Jeff Darcy" <jdarcy@xxxxxxxxxx> > >>>>>>>> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > >>>>>>>> Sent: Thursday, April 28, 2016 8:15:36 PM > >>>>>>>> Subject: Re: Possible bug in the communications > >>>>>>>> layer > >>>>>>>> ? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi Jeff, > >>>>>>>> > >>>>>>>> On 28.04.2016 15:20, Jeff Darcy wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> This happens with Gluster 3.7.11 accessed through Ganesha and > >>>>>>>> gfapi. > >>>>>>>> The > >>>>>>>> volume is a distributed-disperse 4*(4+2). I'm able to reproduce the > >>>>>>>> problem > >>>>>>>> easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w > >>>>>>>> -F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2 > >>>>>>>> -s10g > >>>>>>>> -r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after > >>>>>>>> starting > >>>>>>>> the > >>>>>>>> read test. As can be seen in the data below, > >>>>>>>> client3_3_readv_cbk() is > >>>>>>>> processing an iovec of 116 bytes, however it should be of 154 bytes > >>>>>>>> (the > >>>>>>>> buffer in memory really seems to contain 154 bytes). The data on > >>>>>>>> the > >>>>>>>> network > >>>>>>>> seems ok (at least I haven't been able to identify any problem), so > >>>>>>>> this > >>>>>>>> must be a processing error on the client side. The last field in > >>>>>>>> cut > >>>>>>>> buffer > >>>>>>>> of the sequentialized data corresponds to the length of the xdata > >>>>>>>> field: > >>>>>>>> 0x26. So at least 38 more byte should be present. > >>>>>>>> Nice detective work, Xavi. It would be *very* interesting to > >>>>>>>> see what > >>>>>>>> the value of the "count" parameter is (it's unfortunately optimized > >>>>>>>> out). > >>>>>>>> I'll bet it's two, and iov[1].iov_len is 38. I have a weak > >>>>>>>> memory of > >>>>>>>> some problems with how this iov is put together, a couple of years > >>>>>>>> ago, > >>>>>>>> and it looks like you might have tripped over one more. > >>>>>>>> It seems you are right. The count is 2 and the first 38 bytes of > >>>>>>>> the > >>>>>>>> second > >>>>>>>> vector contains the remaining data of xdata field. > >>>>>>> > >>>>>>> This is the bug. client3_3_readv_cbk (and for that matter all the > >>>>>>> actors/cbks) expects response in utmost two vectors: > >>>>>>> 1. Program header containing request or response. This is > >>>>>>> subjected to > >>>>>>> decoding/encoding. This vector should point to a buffer that > >>>>>>> contains > >>>>>>> the > >>>>>>> entire program header/response contiguously. > >>>>>>> 2. If the procedure returns payload (like readv response or a write > >>>>>>> request), > >>>>>>> second vector contains the buffer pointing to the entire > >>>>>>> (contiguous) > >>>>>>> payload. Note that this payload is raw and is not subjected to > >>>>>>> encoding/decoding. > >>>>>>> > >>>>>>> In your case, this _clean_ separation is broken with part of program > >>>>>>> header > >>>>>>> slipping into 2nd vector supposed to contain read data (may be > >>>>>>> because > >>>>>>> of > >>>>>>> rpc fragmentation). I think this is a bug in socket layer. I'll > >>>>>>> update > >>>>>>> more > >>>>>>> on this. > >>>>>> > >>>>>> Does your read response include xdata too? I think the code > >>>>>> related to > >>>>>> reading xdata in readv response is a bit murky. > >>>>>> > >>>>>> <socket.c/__socket_read_accepted_successful_reply> > >>>>>> > >>>>>> case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT: > >>>>>> default_read_size = xdr_sizeof ((xdrproc_t) > >>>>>> xdr_gfs3_read_rsp, > >>>>>> &read_rsp); > >>>>>> > >>>>>> proghdr_buf = frag->fragcurrent; > >>>>>> > >>>>>> __socket_proto_init_pending (priv, > >>>>>> default_read_size); > >>>>>> > >>>>>> frag->call_body.reply.accepted_success_state > >>>>>> = SP_STATE_READING_PROC_HEADER; > >>>>>> > >>>>>> /* fall through */ > >>>>>> > >>>>>> case SP_STATE_READING_PROC_HEADER: > >>>>>> __socket_proto_read (priv, ret); > >>>>>> > >>>>>>>>>>> By this time we've read read response _minus_ the xdata > >>>>> > >>>>> I meant we have read "readv response header" > >>>>> > >>>>>> > >>>>>> /* there can be 'xdata' in read response, figure > >>>>>> it out > >>>>>> */ > >>>>>> xdrmem_create (&xdr, proghdr_buf, default_read_size, > >>>>>> XDR_DECODE); > >>>>>> > >>>>>>>>>>>> We created xdr stream above with "default_read_size" (this > >>>>>>>>>>>> doesn't > >>>>>>>>>>>> include xdata) > >>>>>> > >>>>>> /* This will fail if there is xdata sent from > >>>>>> server, if > >>>>>> not, > >>>>>> well and good, we don't need to worry about */ > >>>>>> > >>>>>>>>>>>> what if xdata is present and decoding failed (as length of xdr > >>>>>>>>>>>> stream > >>>>>>>>>>>> above - default_read_size - doesn't include xdata)? would we > >>>>>>>>>>>> have > >>>>>>>>>>>> a > >>>>>>>>>>>> valid value in read_rsp.xdata.xdata_len? This is the part I am > >>>>>>>>>>>> confused about. If read_rsp.xdata.xdata_len is not correct then > >>>>>>>>>>>> there > >>>>>>>>>>>> is a possibility that xdata might not be entirely present in > >>>>>>>>>>>> the > >>>>>>>>>>>> vector socket passes to higher layers as progheader (with > >>>>>>>>>>>> part or > >>>>>>>>>>>> entire xdata spilling over to payload vector). > >>>>>> > >>>>>> xdr_gfs3_read_rsp (&xdr, &read_rsp); > >>>>>> > >>>>>> free (read_rsp.xdata.xdata_val); > >>>>>> > >>>>>> /* need to round off to proper roof (%4), as XDR > >>>>>> packing > >>>>>> pads > >>>>>> the end of opaque object with '0' */ > >>>>>> size = roof (read_rsp.xdata.xdata_len, 4); > >>>>>> > >>>>>> if (!size) { > >>>>>> frag->call_body.reply.accepted_success_state > >>>>>> = SP_STATE_READ_PROC_OPAQUE; > >>>>>> goto read_proc_opaque; > >>>>>> } > >>>>>> > >>>>>> </socket.c> > >>>>>> > >>>>>> Can you please confirm whether there was an xdata in the readv > >>>>>> response > >>>>>> (may > >>>>>> be by looking in bricks) whose decoding failed? > >>>>>> > >>>>>> regards, > >>>>>> Raghavendra > >>>>>> _______________________________________________ > >>>>>> Gluster-devel mailing list > >>>>>> Gluster-devel@xxxxxxxxxxx > >>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel > >>>>>> > >>>> > >>> > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel