Re: k8s kernel clients: reasonable number of mounts per host, and limiting num client sessions

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 06 Apr 2021 07:45:46 -0400

On Tue, 2021-04-06 at 12:32 +0200, Dan van der Ster wrote:
> On Mon, Apr 5, 2021 at 8:33 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > 
> > On Thu, 2021-04-01 at 11:04 +0200, Dan van der Ster wrote:
> > > Hi,
> > > 
> > > Context: one of our users is mounting 350 ceph kernel PVCs per 30GB VM
> > > and they notice "memory pressure".
> > > 
> > 
> > Manifested how?
> 
> Our users lost the monitoring, so we are going to try to reproduce to
> get more details.
> Do you know any way to see how much memory is used by the kernel
> clients? (Aside from the ceph_inode_info and ceph_dentry_info which I
> see in slabtop).

Nothing simple, I'm afraid, and even those don't tell you the full
picture. ceph_dentry_info is a separate allocation from the actual
dentry.

> I see that the osd_client keeps just one copy of the osdmap, so that's
> going to be only ~256kB * num_clients on this particular cluster.
> Do we also need to kmalloc something the size of the pg map? That
> would be ~4MB * num_clients here.
> Are there any other large data structures, even for idle mounts?
> 

Almost certainly, but it's not trivial to measure them. You might start
by looking at net/ceph/osdmap.c in the kernel sources and consider
instrumenting it to report how large its allocations are. We simply
don't keep those sorts of detailed stats of allocations that the client
does.

> > > When planning for k8s hosts, what would be a reasonable limit on the
> > > number of ceph kernel PVCs to mount per host?
> > > 
> > 
> > This seems like a really difficult thing to gauge. It depends on a
> > number of different factors including amount of RAM and CPUs on the box,
> > mount options, workload and applications, etc...
> > 
> > > If one kernel mounts the
> > > same cephfs several times (with different prefixes), we observed that
> > > this is a unique client session. But does the ceph module globally
> > > share a single copy of cluster metadata, e.g. osdmaps, or is that all
> > > duplicated per session?
> > > 
> > 
> > One copy per-cluster client, which should generally be shared between
> > mounts to the same cluster, provided that you're using similar-enough
> > mount options for the kernel to do that.
> 
> As Sage suspected, we have a unique cephx user per PVC mounted.
> We're using the manila csi, which indeed invokes mgr/volumes to create
> the shares. They look like this, for reference:
> 
>         "client_metadata": {
>             "features": "0x0000000000007bff",
>             "entity_id": "pvc-691d1f23-da81-4a08-a6e7-d16f44e5f2a0",
>             "hostname": "paas-standard-avz-b-6qvn6",
>             "kernel_version": "5.10.19-200.fc33.x86_64",
>             "root": "/volumes/_nogroup/dbe3dbbf-e8d6-4f13-aac4-7a116d9a6772"
>         }
> 
> It's good to know that by using the same cephx users, we could
> optimize the clients on a given host.
> 
> > > Also, k8s makes it trivial for a user to mount a single PVC from
> > > hundreds or thousands of clients. Suppose we wanted to be able to
> > > limit the number of clients per PVC -- Do you think a new
> > > `max_sessions=N` cephx cap would be the best approach for this?
> > > 
> > 
> > Why do you want to limit the number of clients per PVC? I'm not sure
> > that would really solve anything.
> 
> Mounting from a huge number of clients can easily overload the MDSs.
> But Manila only lets us hand out CephFS quotas by rbytes or # shares.
> So if we could similarly limit the number of sessions per cephx user
> (i.e. per share), then we can prevent these overloads.
> 

The problem there is that you'll end up with clients that just start
suddenly failing to mount because you hit your arbitrary capacity
limits, and it'll almost certainly be first-come/first served. This is a
different matter than applying quotas because it potentially affects you
at mount time.

> 
> 
> > 
> > FWIW, I'm not a fan of solutions that end up with clients pooping
> > themselves because they get back some esoteric error due to exceeding a
> > limit when trying to mount or something.
> > 
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> > 
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx