Re: [PATCH] libceph: remove the max extents check for sparse read

Ilya Dryomov <idryomov@xxxxxxxxx> · Mon, 6 Nov 2023 12:42:16 +0100

On Mon, Nov 6, 2023 at 7:43 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
>
>
> On 11/3/23 18:14, Jeff Layton wrote:
> > On Fri, 2023-11-03 at 11:07 +0100, Ilya Dryomov wrote:
> >> On Fri, Nov 3, 2023 at 4:41 AM <xiubli@xxxxxxxxxx> wrote:
> >>> From: Xiubo Li <xiubli@xxxxxxxxxx>
> >>>
> >>> There is no any limit for the extent array size and it's possible
> >>> that when reading with a large size contents. Else the messager
> >>> will fail by reseting the connection and keeps resending the inflight
> >>> IOs.
> >>>
> >>> URL: https://tracker.ceph.com/issues/62081
> >>> Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
> >>> ---
> >>>   net/ceph/osd_client.c | 12 ------------
> >>>   1 file changed, 12 deletions(-)
> >>>
> >>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> >>> index 7af35106acaf..177a1d92c517 100644
> >>> --- a/net/ceph/osd_client.c
> >>> +++ b/net/ceph/osd_client.c
> >>> @@ -5850,8 +5850,6 @@ static inline void convert_extent_map(struct ceph_sparse_read *sr)
> >>>   }
> >>>   #endif
> >>>
> >>> -#define MAX_EXTENTS 4096
> >>> -
> >>>   static int osd_sparse_read(struct ceph_connection *con,
> >>>                             struct ceph_msg_data_cursor *cursor,
> >>>                             char **pbuf)
> >>> @@ -5882,16 +5880,6 @@ static int osd_sparse_read(struct ceph_connection *con,
> >>>
> >>>                  if (count > 0) {
> >>>                          if (!sr->sr_extent || count > sr->sr_ext_len) {
> >>> -                               /*
> >>> -                                * Apply a hard cap to the number of extents.
> >>> -                                * If we have more, assume something is wrong.
> >>> -                                */
> >>> -                               if (count > MAX_EXTENTS) {
> >>> -                                       dout("%s: OSD returned 0x%x extents in a single reply!\n",
> >>> -                                            __func__, count);
> >>> -                                       return -EREMOTEIO;
> >>> -                               }
> >>> -
> >>>                                  /* no extent array provided, or too short */
> >>>                                  kfree(sr->sr_extent);
> >>>                                  sr->sr_extent = kmalloc_array(count,
> >>> --
> >>> 2.39.1
> >>>
> >> Hi Xiubo,
> >>
> >> As noted in the tracker ticket, there are many "sanity" limits like
> >> that in the messenger and other parts of the kernel client.  First,
> >> let's change that dout to pr_warn_ratelimited so that it's immediately
> >> clear what is going on.  Then, if the limit actually gets hit, let's
> >> dig into why and see if it can be increased rather than just removed.
> >>
> > Yeah, agreed. I think when I wrote this, I couldn't figure out if there
> > was an actual hard cap on the number of extents, so I figured 4k ought
> > to be enough for anybody. Clearly that was wrong though.
> >
> > I'd still favor raising the cap instead eliminating it altogether. Is
> > there a hard cap on the number of extents that the OSD will send in a
> > single reply? That's really what this limit should be.
>
> I went through the messager code again carefully, I found that even in
> case when the errno is '-ENOMEM' for a request the messager will trigger
> the connection fault, which will reconnect the connection and retry all
> the osd requests. This looks incorrect.

In theory, ENOMEM can be transient.  If memory is too fragmented (e.g.
kmalloc is used and there is no physically contiguous chunk of required
size available), it makes sense to retry the allocation after some time
passes.

>
> IMO only in case when the errno is any of '-EBADMSG' or '-EREMOTEIO',
> etc should we retry the osd requests. And for the errors that caused by
> the client side we should fail the osd requests instead.

The messenger never fails higher level requests, no matter what the
error is.  Whether it's a good idea is debatable (personally I'm not
a fan), but this is how it behaves in userspace, so there isn't much
implementation freedom here.

Thanks,

                Ilya