Re: sparse read extent limits (was: libceph: remove the max extents check for sparse read)

Ilya Dryomov <idryomov@xxxxxxxxx> · Fri, 10 Nov 2023 11:00:17 +0100

On Fri, Nov 10, 2023 at 3:40 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> [ Moving from ceph-devel@vger to dev@ceph ]
>
> Do we need to change the protocol so you can provide a limit on the
> number of extents when you send a sparse read op?

Hi Greg,

A raw "max_extents" parameter, similar to the limit on the number of
key/value pairs returned in a single OMAP op, doesn't make much sense
to me -- it would be pretty hard to arrange sensible iteration there.

If we wanted to change the protocol, I would introduce a "granularity"
parameter.  If the client is only interested in 4k extents, they would
set it to 4k, the OSD would round up if needed when composing a reply,
and the limit on the number of extents returned would naturally fall
out.

>
> I rather suspect the workloads generating this are the worst case
> scenario and that we may want server-side limits in any case

Yeah, and we could get away without the protocol change then!  If we
introduced a config option for fiemap/sparse-read reported granularity
in the OSD with a minimum of 512 or 4k, we could also shave off some
overhead in BlueStore by storing less extent_map data -- currently
BlueStore tracks allocated extents at byte level irrespective of the
actual block size (bluestore_min_alloc_size, etc).  Sam previously
mentioned that SeaStore would probably not track sparseness at more
than 4k, so it might be reasonable to cap the minimum at that for all
ObjectStore implementations.

Thanks,

                Ilya

>
> On Mon, Nov 6, 2023 at 5:02 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
> >
> >
> > On 11/6/23 20:32, Ilya Dryomov wrote:
> > > On Mon, Nov 6, 2023 at 1:15 PM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
> > >>
> > >> On 11/6/23 19:38, Ilya Dryomov wrote:
> > >>> On Mon, Nov 6, 2023 at 1:14 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
> > >>>> On 11/3/23 18:07, Ilya Dryomov wrote:
> > >>>>
> > >>>> On Fri, Nov 3, 2023 at 4:41 AM <xiubli@xxxxxxxxxx> wrote:
> > >>>>
> > >>>> From: Xiubo Li <xiubli@xxxxxxxxxx>
> > >>>>
> > >>>> There is no any limit for the extent array size and it's possible
> > >>>> that when reading with a large size contents. Else the messager
> > >>>> will fail by reseting the connection and keeps resending the inflight
> > >>>> IOs.
> > >>>>
> > >>>> URL: https://tracker.ceph.com/issues/62081
> > >>>> Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
> > >>>> ---
> > >>>>    net/ceph/osd_client.c | 12 ------------
> > >>>>    1 file changed, 12 deletions(-)
> > >>>>
> > >>>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> > >>>> index 7af35106acaf..177a1d92c517 100644
> > >>>> --- a/net/ceph/osd_client.c
> > >>>> +++ b/net/ceph/osd_client.c
> > >>>> @@ -5850,8 +5850,6 @@ static inline void convert_extent_map(struct ceph_sparse_read *sr)
> > >>>>    }
> > >>>>    #endif
> > >>>>
> > >>>> -#define MAX_EXTENTS 4096
> > >>>> -
> > >>>>    static int osd_sparse_read(struct ceph_connection *con,
> > >>>>                              struct ceph_msg_data_cursor *cursor,
> > >>>>                              char **pbuf)
> > >>>> @@ -5882,16 +5880,6 @@ static int osd_sparse_read(struct ceph_connection *con,
> > >>>>
> > >>>>                   if (count > 0) {
> > >>>>                           if (!sr->sr_extent || count > sr->sr_ext_len) {
> > >>>> -                               /*
> > >>>> -                                * Apply a hard cap to the number of extents.
> > >>>> -                                * If we have more, assume something is wrong.
> > >>>> -                                */
> > >>>> -                               if (count > MAX_EXTENTS) {
> > >>>> -                                       dout("%s: OSD returned 0x%x extents in a single reply!\n",
> > >>>> -                                            __func__, count);
> > >>>> -                                       return -EREMOTEIO;
> > >>>> -                               }
> > >>>> -
> > >>>>                                   /* no extent array provided, or too short */
> > >>>>                                   kfree(sr->sr_extent);
> > >>>>                                   sr->sr_extent = kmalloc_array(count,
> > >>>> --
> > >>>> 2.39.1
> > >>>>
> > >>>> Hi Xiubo,
> > >>>>
> > >>>> As noted in the tracker ticket, there are many "sanity" limits like
> > >>>> that in the messenger and other parts of the kernel client.  First,
> > >>>> let's change that dout to pr_warn_ratelimited so that it's immediately
> > >>>> clear what is going on.
> > >>>>
> > >>>> Sounds good to me.
> > >>>>
> > >>>>    Then, if the limit actually gets hit, let's
> > >>>> dig into why and see if it can be increased rather than just removed.
> > >>>>
> > >>>> The test result in https://tracker.ceph.com/issues/62081#note-16 is that I just changed the 'len' to 5000 in ceph PR https://github.com/ceph/ceph/pull/54301 to emulate a random writes to a file and then tries to read with a large size:
> > >>>>
> > >>>> [ RUN      ] LibRadosIoPP.SparseReadExtentArrayOpPP
> > >>>> extents array size : 4297
> > >>>>
> > >>>> For the 'extents array size' it could reach up to a very large number, then what it should be ? Any idea ?
> > >>> Hi Xiubo,
> > >>>
> > >>> I don't think it can be a very large number in practice.
> > >>>
> > >>> CephFS uses sparse reads only in the fscrypt case, right?
> > >> Yeah, it is.
> > >>
> > >>
> > >>>     With
> > >>> fscrypt, sub-4K writes to the object store aren't possible, right?
> > >> Yeah.
> > >>
> > >>
> > >>> If the answer to both is yes, then the maximum number of extents
> > >>> would be
> > >>>
> > >>>       64M (CEPH_MSG_MAX_DATA_LEN) / 4K = 16384
> > >>>
> > >>> even if the object store does tracking at byte granularity.
> > >> So yeah, just for the fscrypt case if we set the max extent number to
> > >> 16384 it should be fine. But the sparse read could also be enabled in
> > >> non-fscrypt case.
> > > Ah, I see that it's also exposed as a mount option.  If we expect
> > > people to use it, then dropping MAX_EXTENTS altogether might be the
> > > best approach after all -- it doesn't make sense to warn about
> > > something that we don't really control.
> > >
> > > How about printing the number of extents only if the allocation fails?
> > > Something like:
> > >
> > >      if (!sr->sr_extent) {
> > >              pr_err("failed to allocate %u sparse read extents\n", count);
> > >              return -ENOMEM;
> > >      }
> >
> > Yeah, this also looks good to me.
> >
> > I will fix it.
> >
> > Thanks
> >
> > - Xiubo
> >
> >
> > > Thanks,
> > >
> > >                  Ilya
> > >
> >
> >
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx