On Mon, Mar 4, 2024 at 1:43 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > > > On 3/2/24 01:15, Ilya Dryomov wrote: > > On Fri, Mar 1, 2024 at 2:53 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote: > >> > >> On 2/29/24 18:48, Ilya Dryomov wrote: > >>> On Thu, Feb 29, 2024 at 5:22 AM <xiubli@xxxxxxxxxx> wrote: > >>>> From: Xiubo Li <xiubli@xxxxxxxxxx> > >>>> > >>>> The osd code has remove cursor initilizing code and this will make > >>>> the sparse read state into a infinite loop. We should initialize > >>>> the cursor just before each sparse-read in messnger v2. > >>>> > >>>> Cc: stable@xxxxxxxxxxxxxxx > >>>> URL: https://tracker.ceph.com/issues/64607 > >>>> Fixes: 8e46a2d068c9 ("libceph: just wait for more data to be available on the socket") > >>>> Reported-by: Luis Henriques <lhenriques@xxxxxxx> > >>>> Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx> > >>>> --- > >>>> net/ceph/messenger_v2.c | 3 +++ > >>>> 1 file changed, 3 insertions(+) > >>>> > >>>> diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c > >>>> index a0ca5414b333..7ae0f80100f4 100644 > >>>> --- a/net/ceph/messenger_v2.c > >>>> +++ b/net/ceph/messenger_v2.c > >>>> @@ -2025,6 +2025,7 @@ static int prepare_sparse_read_cont(struct ceph_connection *con) > >>>> static int prepare_sparse_read_data(struct ceph_connection *con) > >>>> { > >>>> struct ceph_msg *msg = con->in_msg; > >>>> + u64 len = con->in_msg->sparse_read_total ? : data_len(con->in_msg); > >>>> > >>>> dout("%s: starting sparse read\n", __func__); > >>>> > >>>> @@ -2034,6 +2035,8 @@ static int prepare_sparse_read_data(struct ceph_connection *con) > >>>> if (!con_secure(con)) > >>>> con->in_data_crc = -1; > >>>> > >>>> + ceph_msg_data_cursor_init(&con->v2.in_cursor, con->in_msg, len); > >>>> + > >>>> reset_in_kvecs(con); > >>>> con->v2.in_state = IN_S_PREPARE_SPARSE_DATA_CONT; > >>>> con->v2.data_len_remain = data_len(msg); > >>>> -- > >>>> 2.43.0 > >>>> > >>> Hi Xiubo, > >>> > >>> How did this get missed? Was generic/580 not paired with msgr2 in crc > >>> mode or are we not running generic/580 at all? > >>> > >>> Multiple runs have happened since the patch was staged so if the matrix > >>> is set up correctly ms_mode=crc should have been in effect for xfstests > >>> at least a couple of times. > >> I just found that my test script is incorrect and missed this case. > >> > >> The test locally is covered the msgr1 mostly and I think the qa test > >> suite also doesn't cover it too. I will try to improve the qa tests later. > > Could you please provide some details on the fixes needed to address > > the coverage gap in the fs suite? > > Mainly to support the msgr v2 for fscrypt, before we only tested the > fscrypt based on the msgr v1 for kclient. In ceph upstream we have > support this while not backport it to reef yet. I'm even more confused now... If the fs suite in main covers msgr2 + fscrypt (I'm taking your "in ceph upstream we have support" to mean that), how did this bug get missed by runs on main? At least a dozen of them must have gone through in the form of Venky's integration branches. Thanks, Ilya