I've undone all changes and now I'm unable to reproduce the problem, so
the modification I did is probably incorrect and not the root cause, as
you described.
I'll continue investigating...
Xavi
On 04/05/16 15:01, Xavier Hernandez wrote:
On 04/05/16 14:47, Raghavendra Gowdappa wrote:
----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Wednesday, May 4, 2016 5:37:56 PM
Subject: Re: Possible bug in the communications layer ?
I think I've found the problem.
1567 case SP_STATE_READING_PROC_HEADER:
1568 __socket_proto_read (priv, ret);
1569
1570 /* there can be 'xdata' in read response, figure
it out */
1571 xdrmem_create (&xdr, proghdr_buf,
default_read_size,
1572 XDR_DECODE);
1573
1574 /* This will fail if there is xdata sent from
server, if not,
1575 well and good, we don't need to worry
about */
1576 xdr_gfs3_read_rsp (&xdr, &read_rsp);
1577
1578 free (read_rsp.xdata.xdata_val);
1579
1580 /* need to round off to proper roof (%4), as XDR
packing pads
1581 the end of opaque object with '0' */
1582 size = roof (read_rsp.xdata.xdata_len, 4);
1583
1584 if (!size) {
1585
frag->call_body.reply.accepted_success_state
1586 = SP_STATE_READ_PROC_OPAQUE;
1587 goto read_proc_opaque;
1588 }
1589
1590 __socket_proto_init_pending (priv, size);
1591
1592 frag->call_body.reply.accepted_success_state
1593 = SP_STATE_READING_PROC_OPAQUE;
1594
1595 case SP_STATE_READING_PROC_OPAQUE:
1596 __socket_proto_read (priv, ret);
1597
1598 frag->call_body.reply.accepted_success_state
1599 = SP_STATE_READ_PROC_OPAQUE;
On line 1568 we read, at most, 116 bytes because we calculate the size
of a read response without xdata. Then we detect that we really need
more data for xdata (BTW, read_rsp.xdata.xdata_val will be always
allocated even if xdr_gfs3_read_rsp() fails ?)
No. It need not be. Its guaranteed that only on a successful
completion it is allocated. However, _if_ decoding fails only because
xdr stream doesn't include xdata bits, but xdata_len is zero (by
initializing it to default_read_size), then xdr library would've
filled read_rsp.xdata.xdata_len (read_rsp.xdata.xdata_val can still be
NULL).
The question is: is it guaranteed that after an unsuccessful completion
xdata_val will be NULL (i.e. not touched by the function, even if
xadata_len is != 0) ? otherwise the free() could corrupt memory.
So we get into line 1596 with the pending info initialized to read the
remaining data. This is the __socket_proto_read macro:
166 /* This will be used in a switch case and breaks from the switch
case if all
167 * the pending data is not read.
168 */
169 #define __socket_proto_read(priv, ret)
\
170 {
\
171 size_t bytes_read = 0;
\
172 struct gf_sock_incoming *in;
\
173 in = &priv->incoming;
\
174
\
175 __socket_proto_update_pending (priv);
\
176
\
177 ret = __socket_readv (this,
\
178 in->pending_vector, 1,
\
179 &in->pending_vector,
\
180 &in->pending_count,
\
181 &bytes_read);
\
182 if (ret == -1)
\
183 break;
\
184 __socket_proto_update_priv_after_read (priv, ret,
bytes_read); \
185 }
We read from the socket using __socket_readv(). It it fails, we quit.
However if the socket doesn't have more data to read, this function does
not return -1:
555 ret = __socket_cached_read (this,
opvector, opcount);
556
557 if (ret == 0) {
558
gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
559 errno = ENODATA;
560 ret = -1;
561 }
562 if (ret == -1 && errno == EAGAIN) {
563 /* done for now */
564 break;
565 }
566 this->total_bytes_read += ret;
If __socket_cached_read() fails with errno == EAGAIN, we break and
return opcount, which is >= 0. Causing the process to continue instead
of waiting for more data.
No. If you observe, there is a call to another macro
__socket_proto_update_priv_after_read at line 184. As can be seen in
the definition below, if (ret > 0) - which is the case in case of
short-read - we do break. I hope I am not missing anything here :).
#define __socket_proto_update_priv_after_read(priv, ret, bytes_read) \
{ \
struct gf_sock_incoming_frag *frag; \
frag = &priv->incoming.frag; \
\
frag->fragcurrent += bytes_read; \
frag->bytes_read += bytes_read; \
\
if ((ret > 0) || (frag->remaining_size != 0)) { \
if (frag->remaining_size != 0 && ret == 0) { \
__socket_proto_reset_pending (priv); \
} \
\
gf_log (this->name, GF_LOG_TRACE, \
"partial read on non-blocking socket"); \
\
break; \
} \
}
I didn't see this. However the fact is that doing the change I've
described, it seems to work.
I'll analyze it a bit more.
Xavi
As a side note, there's another problem here: if errno is not EAGAIN,
we'll update this->total_bytes_read substracting one. This shouldn't be
done when ret < 0.
There are other places where ret is set to -1, but opcount is returned.
I guess that we should also set opcount = -1 on these places, but I
don't have a deep knowledge about this implementation.
I've done a quick test checking for (ret != 0) instead of (ret == -1) in
__socket_proto_read() and it seemed to work.
Could anyone with more knowledge about the communications layer verify
this and explain what would be the best solution ?
Xavi
On 29/04/16 14:52, Xavier Hernandez wrote:
With your patch applied, it seems that the bug is not hit.
I guess it's a timing issue that the new logging hides. Maybe no more
data available after reading the partial readv header ? (it will arrive
later).
I'll continue testing...
Xavi
On 29/04/16 13:48, Raghavendra Gowdappa wrote:
Attaching the patch.
----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 5:14:02 PM
Subject: Re: Possible bug in the communications
layer ?
----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 1:21:57 PM
Subject: Re: Possible bug in the communications
layer ?
Hi Raghavendra,
yes, the readv response contains xdata. The dict length is 38 (0x26)
and, at the moment of failure, rsp.xdata.xdata_len already contains
0x26.
rsp.xdata.xdata_len having 0x26 even when decoding failed indicates
that the
approach used in socket.c to get the length of xdata is correct.
However, I
cannot find any other way of xdata going into payload vector other
than
xdata_len being zero. Just to be double sure, I've a patch containing
debug
message printing xdata_len when decoding fails in socket.c. Can you
please
apply the patch, run the tests and revert back with results?
Xavi
On 29/04/16 09:10, Raghavendra Gowdappa wrote:
----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 12:36:43 PM
Subject: Re: Possible bug in the communications
layer ?
----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Jeff Darcy" <jdarcy@xxxxxxxxxx>, "Gluster Devel"
<gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 12:07:59 PM
Subject: Re: Possible bug in the communications
layer ?
----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Jeff Darcy" <jdarcy@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Thursday, April 28, 2016 8:15:36 PM
Subject: Re: Possible bug in the communications
layer
?
Hi Jeff,
On 28.04.2016 15:20, Jeff Darcy wrote:
This happens with Gluster 3.7.11 accessed through Ganesha and
gfapi.
The
volume is a distributed-disperse 4*(4+2). I'm able to
reproduce the
problem
easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2
-s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
starting
the
read test. As can be seen in the data below,
client3_3_readv_cbk() is
processing an iovec of 116 bytes, however it should be of 154
bytes
(the
buffer in memory really seems to contain 154 bytes). The data on
the
network
seems ok (at least I haven't been able to identify any
problem), so
this
must be a processing error on the client side. The last field in
cut
buffer
of the sequentialized data corresponds to the length of the
xdata
field:
0x26. So at least 38 more byte should be present.
Nice detective work, Xavi. It would be *very* interesting to
see what
the value of the "count" parameter is (it's unfortunately
optimized
out).
I'll bet it's two, and iov[1].iov_len is 38. I have a weak
memory of
some problems with how this iov is put together, a couple of
years
ago,
and it looks like you might have tripped over one more.
It seems you are right. The count is 2 and the first 38 bytes of
the
second
vector contains the remaining data of xdata field.
This is the bug. client3_3_readv_cbk (and for that matter all the
actors/cbks) expects response in utmost two vectors:
1. Program header containing request or response. This is
subjected to
decoding/encoding. This vector should point to a buffer that
contains
the
entire program header/response contiguously.
2. If the procedure returns payload (like readv response or a
write
request),
second vector contains the buffer pointing to the entire
(contiguous)
payload. Note that this payload is raw and is not subjected to
encoding/decoding.
In your case, this _clean_ separation is broken with part of
program
header
slipping into 2nd vector supposed to contain read data (may be
because
of
rpc fragmentation). I think this is a bug in socket layer. I'll
update
more
on this.
Does your read response include xdata too? I think the code
related to
reading xdata in readv response is a bit murky.
<socket.c/__socket_read_accepted_successful_reply>
case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
default_read_size = xdr_sizeof ((xdrproc_t)
xdr_gfs3_read_rsp,
&read_rsp);
proghdr_buf = frag->fragcurrent;
__socket_proto_init_pending (priv,
default_read_size);
frag->call_body.reply.accepted_success_state
= SP_STATE_READING_PROC_HEADER;
/* fall through */
case SP_STATE_READING_PROC_HEADER:
__socket_proto_read (priv, ret);
By this time we've read read response _minus_ the xdata
I meant we have read "readv response header"
/* there can be 'xdata' in read response, figure
it out
*/
xdrmem_create (&xdr, proghdr_buf,
default_read_size,
XDR_DECODE);
We created xdr stream above with "default_read_size" (this
doesn't
include xdata)
/* This will fail if there is xdata sent from
server, if
not,
well and good, we don't need to worry about */
what if xdata is present and decoding failed (as length
of xdr
stream
above - default_read_size - doesn't include xdata)? would we
have
a
valid value in read_rsp.xdata.xdata_len? This is the part
I am
confused about. If read_rsp.xdata.xdata_len is not
correct then
there
is a possibility that xdata might not be entirely present in
the
vector socket passes to higher layers as progheader (with
part or
entire xdata spilling over to payload vector).
xdr_gfs3_read_rsp (&xdr, &read_rsp);
free (read_rsp.xdata.xdata_val);
/* need to round off to proper roof (%4), as XDR
packing
pads
the end of opaque object with '0' */
size = roof (read_rsp.xdata.xdata_len, 4);
if (!size) {
frag->call_body.reply.accepted_success_state
= SP_STATE_READ_PROC_OPAQUE;
goto read_proc_opaque;
}
</socket.c>
Can you please confirm whether there was an xdata in the readv
response
(may
be by looking in bricks) whose decoding failed?
regards,
Raghavendra
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel