Re: [PATCH] libceph: use ceph_kvmalloc() for osdmap arrays

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 11 Sep 2019 12:07:41 -0400

On Wed, 2019-09-11 at 17:08 +0200, Ilya Dryomov wrote:
> On Wed, Sep 11, 2019 at 4:54 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Tue, 2019-09-10 at 21:41 +0200, Ilya Dryomov wrote:
> > > osdmap has a bunch of arrays that grow linearly with the number of
> > > OSDs.  osd_state, osd_weight and osd_primary_affinity take 4 bytes per
> > > OSD.  osd_addr takes 136 bytes per OSD because of sockaddr_storage.
> > > The CRUSH workspace area also grows linearly with the number of OSDs.
> > > 
> > > Normally these arrays are allocated at client startup.  The osdmap is
> > > usually updated in small incrementals, but once in a while a full map
> > > may need to be processed.  For a cluster with 10000 OSDs, this means
> > > a bunch of 40K allocations followed by a 1.3M allocation, all of which
> > > are currently required to be physically contiguous.  This results in
> > > sporadic ENOMEM errors, hanging the client.
> > > 
> > > Go back to manually (re)allocating arrays and use ceph_kvmalloc() to
> > > fall back to non-contiguous allocation when necessary.
> > > 
> > > Link: https://tracker.ceph.com/issues/40481
> > > Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx>
> > > ---
> > >  net/ceph/osdmap.c | 69 +++++++++++++++++++++++++++++------------------
> > >  1 file changed, 43 insertions(+), 26 deletions(-)
> > > 
> > > diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> > > index 90437906b7bc..4e0de14f80bb 100644
> > > --- a/net/ceph/osdmap.c
> > > +++ b/net/ceph/osdmap.c
> > > @@ -973,11 +973,11 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map)
> > >                                struct ceph_pg_pool_info, node);
> > >               __remove_pg_pool(&map->pg_pools, pi);
> > >       }
> > > -     kfree(map->osd_state);
> > > -     kfree(map->osd_weight);
> > > -     kfree(map->osd_addr);
> > > -     kfree(map->osd_primary_affinity);
> > > -     kfree(map->crush_workspace);
> > > +     kvfree(map->osd_state);
> > > +     kvfree(map->osd_weight);
> > > +     kvfree(map->osd_addr);
> > > +     kvfree(map->osd_primary_affinity);
> > > +     kvfree(map->crush_workspace);
> > >       kfree(map);
> > >  }
> > > 
> > > @@ -986,28 +986,41 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map)
> > >   *
> > >   * The new elements are properly initialized.
> > >   */
> > > -static int osdmap_set_max_osd(struct ceph_osdmap *map, int max)
> > > +static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
> > >  {
> > >       u32 *state;
> > >       u32 *weight;
> > >       struct ceph_entity_addr *addr;
> > > +     u32 to_copy;
> > >       int i;
> > > 
> > > -     state = krealloc(map->osd_state, max*sizeof(*state), GFP_NOFS);
> > > -     if (!state)
> > > -             return -ENOMEM;
> > > -     map->osd_state = state;
> > > +     dout("%s old %u new %u\n", __func__, map->max_osd, max);
> > > +     if (max == map->max_osd)
> > > +             return 0;
> > > 
> > > -     weight = krealloc(map->osd_weight, max*sizeof(*weight), GFP_NOFS);
> > > -     if (!weight)
> > > +     state = ceph_kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
> > > +     weight = ceph_kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
> > > +     addr = ceph_kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
> > 
> > Is GFP_NOFS sufficient here, given that this may be called from rbd?
> > Should we be using NOIO instead (or maybe the PF_MEMALLOC_* equivalent)?
> 
> It should be NOIO, but it has been this way forever, so I kept it
> (keeping the future conversion to scopes that I mentioned in another
> email in mind).
> 

Fair enough then. You can add my Reviewed-by:

-- 
Jeff Layton <jlayton@xxxxxxxxxx>