On Wed, 2023-12-13 at 14:07 +0100, Ilya Dryomov wrote: > On Wed, Dec 13, 2023 at 1:05 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > > > > > On 12/13/23 19:54, Ilya Dryomov wrote: > > > > On Wed, Dec 13, 2023 at 12:03 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > > > On 12/13/23 18:31, Ilya Dryomov wrote: > > > > On Wed, Dec 13, 2023 at 2:02 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > > > On 12/13/23 00:31, Ilya Dryomov wrote: > > > > On Fri, Dec 8, 2023 at 5:08 PM <xiubli@xxxxxxxxxx> wrote: > > > > From: Xiubo Li <xiubli@xxxxxxxxxx> > > > > The messages from ceph maybe split into multiple socket packages > > and we just need to wait for all the data to be availiable on the > > sokcet. > > > > This will add a new _FINISH state for the sparse-read to mark the > > current sparse-read succeeded. Else it will treat it as a new > > sparse-read when the socket receives the last piece of the osd > > request reply message, and the cancel_request() later will help > > init the sparse-read context. > > > > URL: https://tracker.ceph.com/issues/63586 > > Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx> > > --- > > include/linux/ceph/osd_client.h | 1 + > > net/ceph/osd_client.c | 6 +++--- > > 2 files changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h > > index 493de3496cd3..00d98e13100f 100644 > > --- a/include/linux/ceph/osd_client.h > > +++ b/include/linux/ceph/osd_client.h > > @@ -47,6 +47,7 @@ enum ceph_sparse_read_state { > > CEPH_SPARSE_READ_DATA_LEN, > > CEPH_SPARSE_READ_DATA_PRE, > > CEPH_SPARSE_READ_DATA, > > + CEPH_SPARSE_READ_FINISH, > > }; > > > > /* > > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > > index 848ef19055a0..f1705b4f19eb 100644 > > --- a/net/ceph/osd_client.c > > +++ b/net/ceph/osd_client.c > > @@ -5802,8 +5802,6 @@ static int prep_next_sparse_read(struct ceph_connection *con, > > advance_cursor(cursor, sr->sr_req_len - end, false); > > } > > > > - ceph_init_sparse_read(sr); > > - > > /* find next op in this request (if any) */ > > while (++o->o_sparse_op_idx < req->r_num_ops) { > > op = &req->r_ops[o->o_sparse_op_idx]; > > @@ -5919,7 +5917,7 @@ static int osd_sparse_read(struct ceph_connection *con, > > return -EREMOTEIO; > > } > > > > - sr->sr_state = CEPH_SPARSE_READ_HDR; > > + sr->sr_state = CEPH_SPARSE_READ_FINISH; > > goto next_op; > > > > Hi Xiubo, > > > > This code appears to be set up to handle multiple (sparse-read) ops in > > a single OSD request. Wouldn't this break that case? > > > > Yeah, it will break it. I just missed it. > > > > I think the bug is in read_sparse_msg_data(). It shouldn't be calling > > con->ops->sparse_read() after the message has progressed to the footer. > > osd_sparse_read() is most likely fine as is. > > > > Yes it is. We cannot tell exactly whether has it progressed to the > > footer IMO, such as when in case 'con->v1.in_base_pos == > > > > Hi Xiubo, > > > > I don't buy this. If the messenger is trying to read the footer, it > > _has_ progressed to the footer. This is definitive and irreversible, > > not a "maybe". > > > > sizeof(con->v1.in_hdr)' the socket buffer may break just after finishing > > sparse-read and before reading footer or some where in sparse-read. For > > the later case then we should continue in the sparse reads. > > > > The size of the data section of the message is always known. The > > contract is that read_partial_msg_data*() returns 1 and does nothing > > else after the data section is consumed. This is how the messenger is > > told to move on to the footer. > > > > read_partial_sparse_msg_data() doesn't adhere to this contract and > > should be fixed. > > > > Notice how read_partial_msg_data() and read_partial_msg_data_bounce() > > behave: if called at that point (i.e. after the data section has been > > processed, meaning that cursor->total_resid == 0), they do nothing. > > read_sparse_msg_data() should have a similar condition and basically > > no-op itself. > > > > Correct, this what I want to do in the sparse-read code. > > > > No, this needs to be done on the messenger side. sparse-read code > > should not be invoked after the messenger has moved on to the footer. > > > > From my reading, even the messenger has moved on to the 'footer', it > > will always try to invoke to parse the 'header': > > > > read_partial(con, end, size, &con->v1.in_hdr); > > > > But it will do nothing and just returns. > > > > And the same for 'front', 'middle' and '(page) data', then the last for > > 'footer'. > > > > This is correct. > > > > Did I miss something ? > > > > No, it's how the messenger is set up to work. The problem is that > > read_sparse_msg_data() doesn't fit this model, so it should be fixed > > and renamed to read_partial_sparse_msg_data(). > > > > Okay, let me try. > > > > Did you see my new patch in last mail ? Will that work ? > > > > If not I will try to fix it in read_partial_sparse_msg_data(). > > It might work around the problem, but it's not the right fix. Think > about it: what business does code in the OSD client have being called > when the messenger is 14 bytes into reading the footer (number taken > from the log in the cover letter)? All data is read at that point and > the last op in a multi-op OSD request may not even be sparse-read... > The assumption in ceph_osdc_new_request is that if you have a multi-op request, that they are either all reads or all writes. The assumption about writes has been there a long time. I simply added the ability to do the same for reads: 4e8c4c235578 libceph: allow ceph_osdc_new_request to accept a multi-op read Note that we do this in ceph_sync_write in the rmw case, so that does need to keep working. Technically we could have a normal read in the same request as a sparse read but I figured that would be a little silly. That's why it tries to prep a second sparse read once the first is done. If there isn't one then that should fall through to the return 0 just before the "found:" label in prep_next_sparse_read. So, I tend to agree with Ilya that the problem isn't down in the OSD state machine, but more likely at the receive handling layer. It's certainly plausible that I didn't get the handling of short receives right (particularly in the messenger_v1 part). -- Jeff Layton <jlayton@xxxxxxxxxx>