On Wed, 2019-09-11 at 17:08 +0200, Ilya Dryomov wrote: > On Wed, Sep 11, 2019 at 4:54 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Tue, 2019-09-10 at 21:41 +0200, Ilya Dryomov wrote: > > > osdmap has a bunch of arrays that grow linearly with the number of > > > OSDs. osd_state, osd_weight and osd_primary_affinity take 4 bytes per > > > OSD. osd_addr takes 136 bytes per OSD because of sockaddr_storage. > > > The CRUSH workspace area also grows linearly with the number of OSDs. > > > > > > Normally these arrays are allocated at client startup. The osdmap is > > > usually updated in small incrementals, but once in a while a full map > > > may need to be processed. For a cluster with 10000 OSDs, this means > > > a bunch of 40K allocations followed by a 1.3M allocation, all of which > > > are currently required to be physically contiguous. This results in > > > sporadic ENOMEM errors, hanging the client. > > > > > > Go back to manually (re)allocating arrays and use ceph_kvmalloc() to > > > fall back to non-contiguous allocation when necessary. > > > > > > Link: https://tracker.ceph.com/issues/40481 > > > Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx> > > > --- > > > net/ceph/osdmap.c | 69 +++++++++++++++++++++++++++++------------------ > > > 1 file changed, 43 insertions(+), 26 deletions(-) > > > > > > diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c > > > index 90437906b7bc..4e0de14f80bb 100644 > > > --- a/net/ceph/osdmap.c > > > +++ b/net/ceph/osdmap.c > > > @@ -973,11 +973,11 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map) > > > struct ceph_pg_pool_info, node); > > > __remove_pg_pool(&map->pg_pools, pi); > > > } > > > - kfree(map->osd_state); > > > - kfree(map->osd_weight); > > > - kfree(map->osd_addr); > > > - kfree(map->osd_primary_affinity); > > > - kfree(map->crush_workspace); > > > + kvfree(map->osd_state); > > > + kvfree(map->osd_weight); > > > + kvfree(map->osd_addr); > > > + kvfree(map->osd_primary_affinity); > > > + kvfree(map->crush_workspace); > > > kfree(map); > > > } > > > > > > @@ -986,28 +986,41 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map) > > > * > > > * The new elements are properly initialized. > > > */ > > > -static int osdmap_set_max_osd(struct ceph_osdmap *map, int max) > > > +static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max) > > > { > > > u32 *state; > > > u32 *weight; > > > struct ceph_entity_addr *addr; > > > + u32 to_copy; > > > int i; > > > > > > - state = krealloc(map->osd_state, max*sizeof(*state), GFP_NOFS); > > > - if (!state) > > > - return -ENOMEM; > > > - map->osd_state = state; > > > + dout("%s old %u new %u\n", __func__, map->max_osd, max); > > > + if (max == map->max_osd) > > > + return 0; > > > > > > - weight = krealloc(map->osd_weight, max*sizeof(*weight), GFP_NOFS); > > > - if (!weight) > > > + state = ceph_kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS); > > > + weight = ceph_kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS); > > > + addr = ceph_kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS); > > > > Is GFP_NOFS sufficient here, given that this may be called from rbd? > > Should we be using NOIO instead (or maybe the PF_MEMALLOC_* equivalent)? > > It should be NOIO, but it has been this way forever, so I kept it > (keeping the future conversion to scopes that I mentioned in another > email in mind). > Fair enough then. You can add my Reviewed-by: -- Jeff Layton <jlayton@xxxxxxxxxx>