On 1/19/24 19:09, Jeff Layton wrote:
On Fri, 2024-01-19 at 12:35 +0800, Xiubo Li wrote:
On 1/19/24 02:24, Jeff Layton wrote:
On Thu, 2024-01-18 at 18:50 +0800, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>
The messages from ceph maybe split into multiple socket packages
and we just need to wait for all the data to be availiable on the
sokcet.
This will add 'sr_total_resid' to record the total length for all
data items for sparse-read message and 'sr_resid_elen' to record
the current extent total length.
URL: https://tracker.ceph.com/issues/63586
Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
---
include/linux/ceph/messenger.h | 1 +
net/ceph/messenger_v1.c | 32 +++++++++++++++++++++-----------
2 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 2eaaabbe98cb..ca6f82abed62 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -231,6 +231,7 @@ struct ceph_msg_data {
struct ceph_msg_data_cursor {
size_t total_resid; /* across all data items */
+ size_t sr_total_resid; /* across all data items for sparse-read */
struct ceph_msg_data *data; /* current data item */
size_t resid; /* bytes not yet consumed */
diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index 4cb60bacf5f5..2733da891688 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -160,7 +160,9 @@ static size_t sizeof_footer(struct ceph_connection *con)
static void prepare_message_data(struct ceph_msg *msg, u32 data_len)
{
/* Initialize data cursor if it's not a sparse read */
- if (!msg->sparse_read)
+ if (msg->sparse_read)
+ msg->cursor.sr_total_resid = data_len;
+ else
ceph_msg_data_cursor_init(&msg->cursor, msg, data_len);
}
@@ -1032,35 +1034,43 @@ static int read_partial_sparse_msg_data(struct ceph_connection *con)
bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
u32 crc = 0;
int ret = 1;
+ int len;
if (do_datacrc)
crc = con->in_data_crc;
- do {
- if (con->v1.in_sr_kvec.iov_base)
+ while (cursor->sr_total_resid) {
+ len = 0;
+ if (con->v1.in_sr_kvec.iov_base) {
+ len = con->v1.in_sr_kvec.iov_len;
ret = read_partial_message_chunk(con,
&con->v1.in_sr_kvec,
con->v1.in_sr_len,
&crc);
- else if (cursor->sr_resid > 0)
+ len = con->v1.in_sr_kvec.iov_len - len;
+ } else if (cursor->sr_resid > 0) {
+ len = cursor->sr_resid;
ret = read_partial_sparse_msg_extent(con, &crc);
-
- if (ret <= 0) {
- if (do_datacrc)
- con->in_data_crc = crc;
- return ret;
+ len -= cursor->sr_resid;
}
+ cursor->sr_total_resid -= len;
+ if (ret <= 0)
+ break;
memset(&con->v1.in_sr_kvec, 0, sizeof(con->v1.in_sr_kvec));
ret = con->ops->sparse_read(con, cursor,
(char **)&con->v1.in_sr_kvec.iov_base);
+ if (ret <= 0) {
+ ret = ret ? : 1; /* must return > 0 to indicate success */
+ break;
+ }
con->v1.in_sr_len = ret;
- } while (ret > 0);
+ }
if (do_datacrc)
con->in_data_crc = crc;
- return ret < 0 ? ret : 1; /* must return > 0 to indicate success */
+ return ret;
}
static int read_partial_msg_data(struct ceph_connection *con)
Looking back over this code...
The way it works today, once we determine it's a sparse read, we call
read_sparse_msg_data. At that point we call either
read_partial_message_chunk (to read into the kvec) or
read_sparse_msg_extent if sr_resid is already set (indicating that we're
receiving an extent).
read_sparse_msg_extent calls ceph_tcp_recvpage in a loop until
cursor->sr_resid have been received. The exception there when
ceph_tcp_recvpage returns <= 0.
ceph_tcp_recvpage returns 0 if sock_recvmsg returns -EAGAIN (maybe also
in other cases). So it sounds like the client just timed out on a read
from the socket or caught a signal or something?
If that's correct, then do we know what ceph_tcp_recvpage returned when
the problem happened?
It should just return parital data, and we should continue from here in
the next loop when the reset data comes.
Tracking this extra length seems like the wrong fix. We're already
looping in read_sparse_msg_extent until the sr_resid goes to 0.
Yeah, it is and it works well.
ISTM
that it's just that read_sparse_msg_extent is returning inappropriately
in the face of timeouts.
IOW, it does this:
ret = ceph_tcp_recvpage(con->sock, rpage, (int)off, len);
if (ret <= 0)
return ret;
...should it just not be returning there when ret == 0? Maybe it should
be retrying the recvpage instead?
Currently the the ceph_tcp_recvpage() will read data without blocking.
If so we will change the logic here then all the other non-sparse-read
cases will be changed to.
Please note this won't fix anything here in this bug.
Because currently the sparse-read code works well if in step 4 it
partially read the sparse-read data or extents.
But in case of partially reading 'footer' in step 5. What we need to
guarantee is that in the second loop we could skip triggering a new
sparse-read in step 4:
1, /* header */ ===> will skip and do nothing if it has already
read the 'header' data in last loop
2, /* front */ ===> will skip and do nothing if it has
already read the 'front' data in last loop
3, /* middle */ ===> will skip and do nothing if it has already
read the 'middle' data in last loop
4, /* (page) data */ ===> sparse-read here, it also should skip and do
nothing if it has already read the whole 'sparse-read' data in last
loop, but it won't. This is the ROOT CAUSE of this bug.
5, /* footer */ ===> the 'read_partial()' will only read
partial 'footer' data then need to loop start from /* header */ when the
data comes
My patch could guarantee that the sparse-read code will do nothing.
While currently the code will trigger a new sparse-read from beginning
again, which is incorrect.
Jeff, please let me know if you have better approaches. The last one
Ilya mentioned didn't work.
Thanks
- Xiubo