On Wed, Sep 11, 2019 at 4:54 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Tue, 2019-09-10 at 21:41 +0200, Ilya Dryomov wrote: > > osdmap has a bunch of arrays that grow linearly with the number of > > OSDs. osd_state, osd_weight and osd_primary_affinity take 4 bytes per > > OSD. osd_addr takes 136 bytes per OSD because of sockaddr_storage. > > The CRUSH workspace area also grows linearly with the number of OSDs. > > > > Normally these arrays are allocated at client startup. The osdmap is > > usually updated in small incrementals, but once in a while a full map > > may need to be processed. For a cluster with 10000 OSDs, this means > > a bunch of 40K allocations followed by a 1.3M allocation, all of which > > are currently required to be physically contiguous. This results in > > sporadic ENOMEM errors, hanging the client. > > > > Go back to manually (re)allocating arrays and use ceph_kvmalloc() to > > fall back to non-contiguous allocation when necessary. > > > > Link: https://tracker.ceph.com/issues/40481 > > Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx> > > --- > > net/ceph/osdmap.c | 69 +++++++++++++++++++++++++++++------------------ > > 1 file changed, 43 insertions(+), 26 deletions(-) > > > > diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c > > index 90437906b7bc..4e0de14f80bb 100644 > > --- a/net/ceph/osdmap.c > > +++ b/net/ceph/osdmap.c > > @@ -973,11 +973,11 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map) > > struct ceph_pg_pool_info, node); > > __remove_pg_pool(&map->pg_pools, pi); > > } > > - kfree(map->osd_state); > > - kfree(map->osd_weight); > > - kfree(map->osd_addr); > > - kfree(map->osd_primary_affinity); > > - kfree(map->crush_workspace); > > + kvfree(map->osd_state); > > + kvfree(map->osd_weight); > > + kvfree(map->osd_addr); > > + kvfree(map->osd_primary_affinity); > > + kvfree(map->crush_workspace); > > kfree(map); > > } > > > > @@ -986,28 +986,41 @@ void ceph_osdmap_destroy(struct ceph_osdmap *map) > > * > > * The new elements are properly initialized. > > */ > > -static int osdmap_set_max_osd(struct ceph_osdmap *map, int max) > > +static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max) > > { > > u32 *state; > > u32 *weight; > > struct ceph_entity_addr *addr; > > + u32 to_copy; > > int i; > > > > - state = krealloc(map->osd_state, max*sizeof(*state), GFP_NOFS); > > - if (!state) > > - return -ENOMEM; > > - map->osd_state = state; > > + dout("%s old %u new %u\n", __func__, map->max_osd, max); > > + if (max == map->max_osd) > > + return 0; > > > > - weight = krealloc(map->osd_weight, max*sizeof(*weight), GFP_NOFS); > > - if (!weight) > > + state = ceph_kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS); > > + weight = ceph_kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS); > > + addr = ceph_kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS); > > Is GFP_NOFS sufficient here, given that this may be called from rbd? > Should we be using NOIO instead (or maybe the PF_MEMALLOC_* equivalent)? It should be NOIO, but it has been this way forever, so I kept it (keeping the future conversion to scopes that I mentioned in another email in mind). Thanks, Ilya