Re: Possible bug in the communications layer ?

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Thu, 5 May 2016 10:21:20 +0200

I've undone all changes and now I'm unable to reproduce the problem, so 
the modification I did is probably incorrect and not the root cause, as 
you described.

I'll continue investigating...

Xavi

On 04/05/16 15:01, Xavier Hernandez wrote:
On 04/05/16 14:47, Raghavendra Gowdappa wrote:

----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Wednesday, May 4, 2016 5:37:56 PM
Subject: Re:  Possible bug in the communications layer ?

I think I've found the problem.

1567         case SP_STATE_READING_PROC_HEADER:
  1568                 __socket_proto_read (priv, ret);
  1569
  1570                 /* there can be 'xdata' in read response, figure
it out */
  1571                 xdrmem_create (&xdr, proghdr_buf,
default_read_size,
  1572                                XDR_DECODE);
  1573
  1574                 /* This will fail if there is xdata sent from
server, if not,
  1575                    well and good, we don't need to worry
about  */
  1576                 xdr_gfs3_read_rsp (&xdr, &read_rsp);
  1577
  1578                 free (read_rsp.xdata.xdata_val);
  1579
  1580                 /* need to round off to proper roof (%4), as XDR
packing pads
  1581                    the end of opaque object with '0' */
  1582                 size = roof (read_rsp.xdata.xdata_len, 4);
  1583
  1584                 if (!size) {
  1585
frag->call_body.reply.accepted_success_state
  1586                                 = SP_STATE_READ_PROC_OPAQUE;
  1587                         goto read_proc_opaque;
  1588                 }
  1589
  1590                 __socket_proto_init_pending (priv, size);
  1591
  1592                 frag->call_body.reply.accepted_success_state
  1593                         = SP_STATE_READING_PROC_OPAQUE;
  1594
  1595         case SP_STATE_READING_PROC_OPAQUE:
  1596                 __socket_proto_read (priv, ret);
  1597
  1598                 frag->call_body.reply.accepted_success_state
  1599                         = SP_STATE_READ_PROC_OPAQUE;

On line 1568 we read, at most, 116 bytes because we calculate the size
of a read response without xdata. Then we detect that we really need
more data for xdata (BTW, read_rsp.xdata.xdata_val will be always
allocated even if xdr_gfs3_read_rsp() fails ?)

No. It need not be. Its guaranteed that only on a successful
completion it is allocated. However, _if_ decoding fails only because
xdr stream doesn't include xdata bits, but xdata_len is zero (by
initializing it to default_read_size), then xdr library would've
filled read_rsp.xdata.xdata_len (read_rsp.xdata.xdata_val can still be
NULL).

The question is: is it guaranteed that after an unsuccessful completion
xdata_val will be NULL (i.e. not touched by the function, even if
xadata_len is != 0) ? otherwise the free() could corrupt memory.

So we get into line 1596 with the pending info initialized to read the
remaining data. This is the __socket_proto_read macro:

   166 /* This will be used in a switch case and breaks from the switch
case if all
   167  * the pending data is not read.
   168  */
   169 #define __socket_proto_read(priv, ret)
       \
   170                 {
       \
   171                 size_t bytes_read = 0;
       \
   172                 struct gf_sock_incoming *in;
       \
   173                 in = &priv->incoming;
       \
   174
       \
   175                 __socket_proto_update_pending (priv);
       \
   176
       \
   177                 ret = __socket_readv (this,
       \
   178                                       in->pending_vector, 1,
       \
   179                                       &in->pending_vector,
       \
   180                                       &in->pending_count,
       \
   181                                       &bytes_read);
       \
   182                 if (ret == -1)
       \
   183                         break;
       \
   184                 __socket_proto_update_priv_after_read (priv, ret,
bytes_read); \
   185         }

We read from the socket using __socket_readv(). It it fails, we quit.
However if the socket doesn't have more data to read, this function does
not return -1:

   555                         ret = __socket_cached_read (this,
opvector, opcount);
   556
   557                         if (ret == 0) {
   558
gf_log(this->name,GF_LOG_DEBUG,"EOF on socket");
   559                                 errno = ENODATA;
   560                                 ret = -1;
   561                         }
   562                         if (ret == -1 && errno == EAGAIN) {
   563                                 /* done for now */
   564                                 break;
   565                         }
   566                         this->total_bytes_read += ret;

If __socket_cached_read() fails with errno == EAGAIN, we break and
return opcount, which is >= 0. Causing the process to continue instead
of waiting for more data.

No. If you observe, there is a call to another macro
__socket_proto_update_priv_after_read at line 184. As can be seen in
the definition below, if (ret > 0) - which is the case in case of
short-read - we do break. I hope I am not missing anything here :).

#define __socket_proto_update_priv_after_read(priv, ret, bytes_read)    \
        {                                                               \
        struct gf_sock_incoming_frag *frag;                     \
                frag = &priv->incoming.frag;                            \
                                                                        \
                frag->fragcurrent += bytes_read;                        \
                frag->bytes_read += bytes_read;                         \
                                                                        \
                if ((ret > 0) || (frag->remaining_size != 0)) {         \
            if (frag->remaining_size != 0 && ret == 0) {    \
                __socket_proto_reset_pending (priv);    \
                        }                                               \
                                                                        \
                        gf_log (this->name, GF_LOG_TRACE,               \
                                "partial read on non-blocking socket"); \
                                                                        \
                        break;                                          \
                }                                                       \
        }

I didn't see this. However the fact is that doing the change I've
described, it seems to work.

I'll analyze it a bit more.

Xavi

As a side note, there's another problem here: if errno is not EAGAIN,
we'll update this->total_bytes_read substracting one. This shouldn't be
done when ret < 0.

There are other places where ret is set to -1, but opcount is returned.
I guess that we should also set opcount = -1 on these places, but I
don't have a deep knowledge about this implementation.

I've done a quick test checking for (ret != 0) instead of (ret == -1) in
__socket_proto_read() and it seemed to work.

Could anyone with more knowledge about the communications layer verify
this and explain what would be the best solution ?

Xavi

On 29/04/16 14:52, Xavier Hernandez wrote:
With your patch applied, it seems that the bug is not hit.

I guess it's a timing issue that the new logging hides. Maybe no more
data available after reading the partial readv header ? (it will arrive
later).

I'll continue testing...

Xavi

On 29/04/16 13:48, Raghavendra Gowdappa wrote:
Attaching the patch.

----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 5:14:02 PM
Subject: Re:  Possible bug in the communications
layer ?

----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 1:21:57 PM
Subject: Re:  Possible bug in the communications
layer ?

Hi Raghavendra,

yes, the readv response contains xdata. The dict length is 38 (0x26)
and, at the moment of failure, rsp.xdata.xdata_len already contains
0x26.

rsp.xdata.xdata_len having 0x26 even when decoding failed indicates
that the
approach used in socket.c to get the length of xdata is correct.
However, I
cannot find any other way of xdata going into payload vector other
than
xdata_len being zero. Just to be double sure, I've a patch containing
debug
message printing xdata_len when decoding fails in socket.c. Can you
please
apply the patch, run the tests and revert back with results?

Xavi

On 29/04/16 09:10, Raghavendra Gowdappa wrote:

----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 12:36:43 PM
Subject: Re:  Possible bug in the communications
layer ?

----- Original Message -----
From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
Cc: "Jeff Darcy" <jdarcy@xxxxxxxxxx>, "Gluster Devel"
<gluster-devel@xxxxxxxxxxx>
Sent: Friday, April 29, 2016 12:07:59 PM
Subject: Re:  Possible bug in the communications
layer ?

----- Original Message -----
From: "Xavier Hernandez" <xhernandez@xxxxxxxxxx>
To: "Jeff Darcy" <jdarcy@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Thursday, April 28, 2016 8:15:36 PM
Subject: Re:  Possible bug in the communications
layer
?

Hi Jeff,

On 28.04.2016 15:20, Jeff Darcy wrote:

This happens with Gluster 3.7.11 accessed through Ganesha and
gfapi.
The
volume is a distributed-disperse 4*(4+2). I'm able to
reproduce the
problem
easily doing the following test: iozone -t2 -s10g -r1024k -i0 -w
-F/iozone{1..2}.dat echo 3 >/proc/sys/vm/drop_caches iozone -t2
-s10g
-r1024k -i1 -w -F/iozone{1..2}.dat The error happens soon after
starting
the
read test. As can be seen in the data below,
client3_3_readv_cbk() is
processing an iovec of 116 bytes, however it should be of 154
bytes
(the
buffer in memory really seems to contain 154 bytes). The data on
the
network
seems ok (at least I haven't been able to identify any
problem), so
this
must be a processing error on the client side. The last field in
cut
buffer
of the sequentialized data corresponds to the length of the
xdata
field:
0x26. So at least 38 more byte should be present.
Nice detective work, Xavi.  It would be *very* interesting to
see what
the value of the "count" parameter is (it's unfortunately
optimized
out).
I'll bet it's two, and iov[1].iov_len is 38.  I have a weak
memory of
some problems with how this iov is put together, a couple of
years
ago,
and it looks like you might have tripped over one more.
It seems you are right. The count is 2 and the first 38 bytes of
the
second
vector contains the remaining data of xdata field.

This is the bug. client3_3_readv_cbk (and for that matter all the
actors/cbks) expects response in utmost two vectors:
1. Program header containing request or response. This is
subjected to
decoding/encoding. This vector should point to a buffer that
contains
the
entire program header/response contiguously.
2. If the procedure returns payload (like readv response or a
write
request),
second vector contains the buffer pointing to the entire
(contiguous)
payload. Note that this payload is raw and is not subjected to
encoding/decoding.

In your case, this _clean_ separation is broken with part of
program
header
slipping into 2nd vector supposed to contain read data (may be
because
of
rpc fragmentation). I think this is a bug in socket layer. I'll
update
more
on this.

Does your read response include xdata too? I think the code
related to
reading xdata in readv response is a bit murky.

<socket.c/__socket_read_accepted_successful_reply>

        case SP_STATE_ACCEPTED_SUCCESS_REPLY_INIT:
                default_read_size = xdr_sizeof ((xdrproc_t)
                xdr_gfs3_read_rsp,
                                                &read_rsp);

                proghdr_buf = frag->fragcurrent;

                __socket_proto_init_pending (priv,
default_read_size);

                frag->call_body.reply.accepted_success_state
                        = SP_STATE_READING_PROC_HEADER;

                /* fall through */

        case SP_STATE_READING_PROC_HEADER:
                __socket_proto_read (priv, ret);

By this time we've read read response _minus_ the xdata

I meant we have read "readv response header"

                /* there can be 'xdata' in read response, figure
it out
                */
                xdrmem_create (&xdr, proghdr_buf,
default_read_size,
                               XDR_DECODE);

We created xdr stream above with "default_read_size" (this
doesn't
include xdata)

                /* This will fail if there is xdata sent from
server, if
                not,
                   well and good, we don't need to worry about  */

what if xdata is present and decoding failed (as length
of xdr
stream
above - default_read_size - doesn't include xdata)? would we
have
a
valid value in read_rsp.xdata.xdata_len? This is the part
I am
confused about. If read_rsp.xdata.xdata_len is not
correct then
there
is a possibility that xdata might not be entirely present in
the
vector socket passes to higher layers as progheader (with
part or
entire xdata spilling over to payload vector).

                xdr_gfs3_read_rsp (&xdr, &read_rsp);

                free (read_rsp.xdata.xdata_val);

                /* need to round off to proper roof (%4), as XDR
packing
                pads
                   the end of opaque object with '0' */
                size = roof (read_rsp.xdata.xdata_len, 4);

                if (!size) {

frag->call_body.reply.accepted_success_state
                                = SP_STATE_READ_PROC_OPAQUE;
                        goto read_proc_opaque;
                }

</socket.c>

Can you please confirm whether there was an xdata in the readv
response
(may
be by looking in bricks) whose decoding failed?

regards,
Raghavendra
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel