Hi Raghavendra,
I've finally found the bug. It was obvious but I didn't see it.
1561 case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
1562 default_read_size = xdr_sizeof ((xdrproc_t)
xdr_gfs3_read_rsp,
1563 &read_rsp);
1564
1565 proghdr_buf = frag->fragcurrent;
1566
1567 __socket_proto_init_pending (priv,
default_read_size);
1568
1569 frag->call_body.reply.accepted_success_state
1570 = SP_STATE_READING_PROC_HEADER;
1571
1572 /* fall through */
1573
1574 case SP_STATE_READING_PROC_HEADER:
1575 __socket_proto_read (priv, ret);
1576
1577 gf_trace_add("xdrmem_create", default_read_size,
(uintptr_t)proghdr_buf);
1578 /* there can be 'xdata' in read response, figure
it out */
1579 xdrmem_create (&xdr, proghdr_buf, default_read_size,
1580 XDR_DECODE);
1581
1582 /* This will fail if there is xdata sent from
server, if not,
1583 well and good, we don't need to worry about */
1584 xdr_gfs3_read_rsp (&xdr, &read_rsp);
1585
1586 free (read_rsp.xdata.xdata_val);
1587
1588 /* need to round off to proper roof (%4), as XDR
packing pads
1589 the end of opaque object with '0' */
1590 size = roof (read_rsp.xdata.xdata_len, 4);
1591
1592 if (!size) {
1593 frag->call_body.reply.accepted_success_state
1594 = SP_STATE_READ_PROC_OPAQUE;
1595 goto read_proc_opaque;
1596 }
1597
1598 __socket_proto_init_pending (priv, size);
1599
1600 frag->call_body.reply.accepted_success_state
1601 = SP_STATE_READING_PROC_OPAQUE;
The main problem here is that we are using two local variables
(proghdr_buf and default_read_size) in two distinct states that might be
called at different times.
The particular case that is failing is the following:
1. In state SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT, everything is prepared
to read 116 bytes. default_read_size is set to 116 and proghdr_buf
points to the buffer where data will be written.
2. In state SP_STATE_READING_PROC_HEADER, a partial read of 88 bytes is
done. At this point the function returns and proghdr_buf and
default_read_size are lost.
3. When more data is available, this function is called again and it
starts executing at state SP_STATE_READING_PROC_HEADER.
4. The remaining 28 bytes are read.
5. When it checks the buffer and tries to decode it to see if there's
xdata present, it uses the default values of proghdr_buf and
default_read_size, that are 0. This causes the decode to leave
read_rsp.xdata.xdata_len set to 0.
6. The program interprets that xdata_len being 0 means that there's no
xdata, so it continues reading the remaining of the RPC packet into the
payload buffer.
If you want, I can send a patch for this.
Xavi
On 05/05/16 10:21, Xavier Hernandez wrote:
I've undone all changes and now I'm unable to reproduce the problem, so
the modification I did is probably incorrect and not the root cause, as
you described.
I'll continue investigating...
Xavi
On 04/05/16 15:01, Xavier Hernandez wrote:
On 04/05/16 14:47, Raghavendra Gowdappa wrote:
----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Wednesday, May 4, 2016 5:37:56 PM
Subject: Re: Possible bug in the communications layer ?
I think I've found the problem.
1567 case SP_STATE_READING_PROC_HEADER:
1568 __socket_proto_read (priv, ret);
1569
1570 /* there can be 'xdata' in read response, figure
it out */
1571 xdrmem_create (&xdr, proghdr_buf,
default_read_size,
1572 XDR_DECODE);
1573
1574 /* This will fail if there is xdata sent from
server, if not,
1575 well and good, we don't need to worry
about */
1576 xdr_gfs3_read_rsp (&xdr, &read_rsp);
1577
1578 free (read_rsp.xdata.xdata_val);
1579
1580 /* need to round off to proper roof (%4), as XDR
packing pads
1581 the end of opaque object with '0' */
1582 size = roof (read_rsp.xdata.xdata_len, 4);
1583
1584 if (!size) {
1585
frag->call_body.reply.accepted_success_state
1586 = SP_STATE_READ_PROC_OPAQUE;
1587 goto read_proc_opaque;
1588 }
1589
1590 __socket_proto_init_pending (priv, size);
1591
1592 frag->call_body.reply.accepted_success_state
1593 = SP_STATE_READING_PROC_OPAQUE;
1594
1595 case SP_STATE_READING_PROC_OPAQUE:
1596 __socket_proto_read (priv, ret);
1597
1598 frag->call_body.reply.accepted_success_state
1599 = SP_STATE_READ_PROC_OPAQUE;
On line 1568 we read, at most, 116 bytes because we calculate the size
of a read response without xdata. Then we detect that we really need
more data for xdata (BTW, read_rsp.xdata.xdata_val will be always
allocated even if xdr_gfs3_read_rsp() fails ?)
No. It need not be. Its guaranteed that only on a successful
completion it is allocated. However, _if_ decoding fails only because
xdr stream doesn't include xdata bits, but xdata_len is zero (by
initializing it to default_read_size), then xdr library would've
filled read_rsp.xdata.xdata_len (read_rsp.xdata.xdata_val can still be
NULL).
The question is: is it guaranteed that after an unsuccessful completion
xdata_val will be NULL (i.e. not touched by the function, even if
xadata_len is != 0) ? otherwise the free() could corrupt memory.
So we get into line 1596 with the pending info initialized to read the
remaining data. This is the __socket_proto_read macro:
166 /* This will be used in a switch case and breaks from the switch
case if all
167 * the pending data is not read.
168 */
169 #define __socket_proto_read(priv, ret)
\
170 {
\
171 size_t bytes_read = 0;
\
172 struct gf_sock_incoming *in;
\
173 in = &priv->incoming;
\
174
\
175 __socket_proto_update_pending (priv);
\
176
\
177 ret = __socket_readv (this,
\
178 in->pending_vector, 1,
\
179 &in->pending_vector,
\
180 &in->pending_count,
\
181 &bytes_read);
\
182 if (ret == -1)
\
183 break;
\
184 __socket_proto_update_priv_after_read (priv,
ret,
bytes_read); \
185 }
We read from the socket using __socket_readv(). It it fails, we quit.
However if the socket doesn't have more data to read, this function
does
not return -1:
555 ret = __socket_cached_read (this,
opvector, opcount);
556
557 if (ret == 0) {
558
gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
559 errno = ENODATA;
560 ret = -1;
561 }
562 if (ret == -1 && errno == EAGAIN) {
563 /* done for now */
564 break;
565 }
566 this->total_bytes_read += ret;
If __socket_cached_read() fails with errno == EAGAIN, we break and
return opcount, which is >= 0. Causing the process to continue instead
of waiting for more data.
No. If you observe, there is a call to another macro
__socket_proto_update_priv_after_read at line 184. As can be seen in
the definition below, if (ret > 0) - which is the case in case of
short-read - we do break. I hope I am not missing anything here :).
#define __socket_proto_update_priv_after_read(priv, ret,
bytes_read) \
{ \
struct gf_sock_incoming_frag *frag; \
frag =
&priv->incoming.frag; \
\
frag->fragcurrent +=
bytes_read; \
frag->bytes_read +=
bytes_read; \
\
if ((ret > 0) || (frag->remaining_size != 0))
{ \
if (frag->remaining_size != 0 && ret == 0) { \
__socket_proto_reset_pending (priv); \
} \
\
gf_log (this->name,
GF_LOG_TRACE, \
"partial read on non-blocking
socket"); \
\
break; \
} \
}
I didn't see this. However the fact is that doing the change I've
described, it seems to work.
I'll analyze it a bit more.
Xavi
As a side note, there's another problem here: if errno is not EAGAIN,
we'll update this->total_bytes_read substracting one. This shouldn't be
done when ret < 0.
There are other places where ret is set to -1, but opcount is returned.
I guess that we should also set opcount = -1 on these places, but I
don't have a deep knowledge about this implementation.
I've done a quick test checking for (ret != 0) instead of (ret ==
-1) in
__socket_proto_read() and it seemed to work.
Could anyone with more knowledge about the communications layer verify
this and explain what would be the best solution ?
Xavi
On 29/04/16 14:52, Xavier Hernandez wrote:
With your patch applied, it seems that the bug is not hit.
I guess it's a timing issue that the new logging hides. Maybe no more
data available after reading the partial readv header ? (it will
arrive
later).
I'll continue testing...
Xavi
On 29/04/16 13:48, Raghavendra Gowdappa wrote:
Attaching the patch.
----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 5:14:02 PM
Subject: Re: Possible bug in the communications
layer ?
----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 1:21:57 PM
Subject: Re: Possible bug in the communications
layer ?
Hi Raghavendra,
yes, the readv response contains xdata. The dict length is 38
(0x26)
and, at the moment of failure, rsp.xdata.xdata_len already contains
0x26.
rsp.xdata.xdata_len having 0x26 even when decoding failed indicates
that the
approach used in socket.c to get the length of xdata is correct.
However, I
cannot find any other way of xdata going into payload vector other
than
xdata_len being zero. Just to be double sure, I've a patch
containing
debug
message printing xdata_len when decoding fails in socket.c. Can you
please
apply the patch, run the tests and revert back with results?
Xavi
On 29/04/16 09:10, Raghavendra Gowdappa wrote:
----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 12:36:43 PM
Subject: Re: Possible bug in the communications
layer ?
----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Jeff Darcy" <jdarcy@xxxxxxxxxx>, "Gluster Devel"
<gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 12:07:59 PM
Subject: Re: Possible bug in the communications
layer ?
----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Jeff Darcy" <jdarcy@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Thursday, April 28, 2016 8:15:36 PM
Subject: Re: Possible bug in the communications
layer
?
Hi Jeff,
On 28.04.2016 15:20, Jeff Darcy wrote:
This happens with Gluster 3.7.11 accessed through Ganesha and
gfapi.
The
volume is a distributed-disperse 4*(4+2). I'm able to
reproduce the
problem
easily doing the following test: iozone -t2 -s10g -r1024k
-i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2
-s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
starting
the
read test. As can be seen in the data below,
client3_3_readv_cbk() is
processing an iovec of 116 bytes, however it should be of 154
bytes
(the
buffer in memory really seems to contain 154 bytes). The
data on
the
network
seems ok (at least I haven't been able to identify any
problem), so
this
must be a processing error on the client side. The last
field in
cut
buffer
of the sequentialized data corresponds to the length of the
xdata
field:
0x26. So at least 38 more byte should be present.
Nice detective work, Xavi. It would be *very* interesting to
see what
the value of the "count" parameter is (it's unfortunately
optimized
out).
I'll bet it's two, and iov[1].iov_len is 38. I have a weak
memory of
some problems with how this iov is put together, a couple of
years
ago,
and it looks like you might have tripped over one more.
It seems you are right. The count is 2 and the first 38
bytes of
the
second
vector contains the remaining data of xdata field.
This is the bug. client3_3_readv_cbk (and for that matter all
the
actors/cbks) expects response in utmost two vectors:
1. Program header containing request or response. This is
subjected to
decoding/encoding. This vector should point to a buffer that
contains
the
entire program header/response contiguously.
2. If the procedure returns payload (like readv response or a
write
request),
second vector contains the buffer pointing to the entire
(contiguous)
payload. Note that this payload is raw and is not subjected to
encoding/decoding.
In your case, this _clean_ separation is broken with part of
program
header
slipping into 2nd vector supposed to contain read data (may be
because
of
rpc fragmentation). I think this is a bug in socket layer. I'll
update
more
on this.
Does your read response include xdata too? I think the code
related to
reading xdata in readv response is a bit murky.
<socket.c/__socket_read_accepted_successful_reply>
case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
default_read_size = xdr_sizeof ((xdrproc_t)
xdr_gfs3_read_rsp,
&read_rsp);
proghdr_buf = frag->fragcurrent;
__socket_proto_init_pending (priv,
default_read_size);
frag->call_body.reply.accepted_success_state
= SP_STATE_READING_PROC_HEADER;
/* fall through */
case SP_STATE_READING_PROC_HEADER:
__socket_proto_read (priv, ret);
By this time we've read read response _minus_ the xdata
I meant we have read "readv response header"
/* there can be 'xdata' in read response, figure
it out
*/
xdrmem_create (&xdr, proghdr_buf,
default_read_size,
XDR_DECODE);
We created xdr stream above with "default_read_size" (this
doesn't
include xdata)
/* This will fail if there is xdata sent from
server, if
not,
well and good, we don't need to worry
about */
what if xdata is present and decoding failed (as length
of xdr
stream
above - default_read_size - doesn't include xdata)?
would we
have
a
valid value in read_rsp.xdata.xdata_len? This is the part
I am
confused about. If read_rsp.xdata.xdata_len is not
correct then
there
is a possibility that xdata might not be entirely
present in
the
vector socket passes to higher layers as progheader (with
part or
entire xdata spilling over to payload vector).
xdr_gfs3_read_rsp (&xdr, &read_rsp);
free (read_rsp.xdata.xdata_val);
/* need to round off to proper roof (%4), as XDR
packing
pads
the end of opaque object with '0' */
size = roof (read_rsp.xdata.xdata_len, 4);
if (!size) {
frag->call_body.reply.accepted_success_state
= SP_STATE_READ_PROC_OPAQUE;
goto read_proc_opaque;
}
</socket.c>
Can you please confirm whether there was an xdata in the readv
response
(may
be by looking in bricks) whose decoding failed?
regards,
Raghavendra
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel