Re: [PATCH v4 3/3] libceph: just wait for more data to be available on the socket

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 1/19/24 19:09, Jeff Layton wrote:
On Fri, 2024-01-19 at 12:35 +0800, Xiubo Li wrote:
On 1/19/24 02:24, Jeff Layton wrote:
On Thu, 2024-01-18 at 18:50 +0800, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>

The messages from ceph maybe split into multiple socket packages
and we just need to wait for all the data to be availiable on the
sokcet.

This will add 'sr_total_resid' to record the total length for all
data items for sparse-read message and 'sr_resid_elen' to record
the current extent total length.

URL: https://tracker.ceph.com/issues/63586
Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
---
   include/linux/ceph/messenger.h |  1 +
   net/ceph/messenger_v1.c        | 32 +++++++++++++++++++++-----------
   2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 2eaaabbe98cb..ca6f82abed62 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -231,6 +231,7 @@ struct ceph_msg_data {
struct ceph_msg_data_cursor {
   	size_t			total_resid;	/* across all data items */
+	size_t			sr_total_resid;	/* across all data items for sparse-read */
struct ceph_msg_data *data; /* current data item */
   	size_t			resid;		/* bytes not yet consumed */
diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index 4cb60bacf5f5..2733da891688 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
   static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
   {
   	/* Initialize data cursor if it's not a sparse read */
-	if (!msg->sparse_read)
+	if (msg->sparse_read)
+		msg->cursor.sr_total_resid = data_len;
+	else
   		ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
   }
@@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
   	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
   	u32 crc = 0;
   	int ret = 1;
+	int len;
if (do_datacrc)
   		crc = con->in_data_crc;
- do {
-		if (con->v1.in_sr_kvec.iov_base)
+	while (cursor->sr_total_resid) {
+		len = 0;
+		if (con->v1.in_sr_kvec.iov_base) {
+			len = con->v1.in_sr_kvec.iov_len;
   			ret = read_partial_message_chunk(con,
   							 &con->v1.in_sr_kvec,
   							 con->v1.in_sr_len,
   							 &crc);
-		else if (cursor->sr_resid > 0)
+			len = con->v1.in_sr_kvec.iov_len - len;
+		} else if (cursor->sr_resid > 0) {
+			len = cursor->sr_resid;
   			ret = read_partial_sparse_msg_extent(con, &crc);
-
-		if (ret <= 0) {
-			if (do_datacrc)
-				con->in_data_crc = crc;
-			return ret;
+			len -= cursor->sr_resid;
   		}
+		cursor->sr_total_resid -= len;
+		if (ret <= 0)
+			break;
memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
   		ret = con->ops->sparse_read(con, cursor,
   				(char **)&con->v1.in_sr_kvec.iov_base);
+		if (ret <= 0) {
+			ret = ret ? : 1; /* must return > 0 to indicate success */
+			break;
+		}
   		con->v1.in_sr_len = ret;
-	} while (ret > 0);
+	}
if (do_datacrc)
   		con->in_data_crc = crc;
- return ret < 0 ? ret : 1; /* must return > 0 to indicate success */
+	return ret;
   }
static int read_partial_msg_data(struct ceph_connection *con)
Looking back over this code...

The way it works today, once we determine it's a sparse read, we call
read_sparse_msg_data. At that point we call either
read_partial_message_chunk (to read into the kvec) or
read_sparse_msg_extent if sr_resid is already set (indicating that we're
receiving an extent).

read_sparse_msg_extent calls ceph_tcp_recvpage in a loop until
cursor->sr_resid have been received. The exception there when
ceph_tcp_recvpage returns <= 0.

ceph_tcp_recvpage returns 0 if sock_recvmsg returns -EAGAIN (maybe also
in other cases). So it sounds like the client just timed out on a read
from the socket or caught a signal or something?

If that's correct, then do we know what ceph_tcp_recvpage returned when
the problem happened?
It should just return parital data, and we should continue from here in
the next loop when the reset data comes.

Tracking this extra length seems like the wrong fix. We're already
looping in read_sparse_msg_extent until the sr_resid goes to 0.
Yeah, it is and it works well.
  ISTM
that it's just that read_sparse_msg_extent is returning inappropriately
in the face of timeouts.

IOW, it does this:

                 ret = ceph_tcp_recvpage(con->sock, rpage, (int)off, len);
                 if (ret <= 0)
                         return ret;

...should it just not be returning there when ret == 0? Maybe it should
be retrying the recvpage instead?

Currently the the ceph_tcp_recvpage() will read data without blocking. If so we will change the logic here then all the other non-sparse-read cases will be changed to.

Please note this won't fix anything here in this bug.

Because currently the sparse-read code works well if in step 4 it partially read the sparse-read data or extents.

But in case of partially reading 'footer' in step 5. What we need to guarantee is that in the second loop we could skip triggering a new sparse-read in step 4:

1, /* header */         ===> will skip and do nothing if it has already read the 'header' data in last loop

2, /* front */             ===> will skip and do nothing if it has already read the 'front' data in last loop

3, /* middle */         ===> will skip and do nothing if it has already read the 'middle' data in last loop

4, /* (page) data */   ===> sparse-read here, it also should skip and do nothing if it has already read the whole 'sparse-read' data in last loop, but it won't. This is the ROOT CAUSE of this bug.

5, /* footer */            ===> the 'read_partial()' will only read partial 'footer' data then need to loop start from /* header */ when the data comes

My patch could guarantee that the sparse-read code will do nothing. While currently the code will trigger a new sparse-read from beginning again, which is incorrect.

Jeff, please let me know if you have better approaches.  The last one Ilya mentioned didn't work.

Thanks

- Xiubo





[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux