Re: Create one millon empty files with cephfs

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 4 Jan 2016 06:57:54 -0800

On Tue, Dec 29, 2015 at 4:55 AM, Fengguang Gong <fengguanggong@xxxxxxxxx> wrote:
> hi,
> We create one million empty files through filebench, here is the test env:
> MDS: one MDS
> MON: one MON
> OSD: two OSD, each with one Inter P3700; data on OSD with 2x replica
> Network: all nodes are connected through 10 gigabit network
>
> We use more than one client to create files, to test the scalability of
> MDS. Here are the results:
> IOPS under one client: 850
> IOPS under two client: 1150
> IOPS under four client: 1180
>
> As we can see, the IOPS almost maintains unchanged when the number of
> client increase from 2 to 4.
>
> Cephfs may have a low scalability under one MDS, and we think its the big
> lock in
> MDSDamon::ms_dispatch()::Mutex::locker(every request acquires this lock),
> who limits the
> scalability of MDS.
>
> We think this big lock could be removed through the following steps:
> 1. separate the process of ClientRequest with other requests, so we can
> parallel the process
> of ClientRequest
> 2. use some small granularity locks instead of big lock to ensure
> consistency
>
> Wondering this idea is reasonable?

Parallelizing the MDS is probably a very big job; it's on our radar
but not for a while yet.

If one were to do it, yes, breaking down the big MDS lock would be the
way forward. I'm not sure entirely what that involves — you'd need to
significantly chunk up the locking on our more critical data
structures, most especially the MDCache. Luckily there is *some* help
there in terms of the file cap locking structures we already have in
place, but it's a *huge* project and not one to be undertaken lightly.
A special processing mechanism for ClientRequests versus other
requests is not an assumption I'd start with.

I think you'll find that file creates are just about the least
scalable thing you can do on CephFS right now, though, so there is
some easier ground. One obvious approach is to extend the current
inode preallocation — it already allocates inodes per-client and has a
fast path inside of the MDS for handing them back. It'd be great if
clients were aware of that preallocation and could create files
without waiting for the MDS to talk back to them! The issue with this
is two-fold:
1) need to update the cap flushing protocol to deal with files newly
created by the client
2) need to handle all the backtrace stuff normally performed by the
MDS on file create (which still needs to happen, on either the client
or the server)
There's also clean up in case of a client failure, but we've already
got a model for that in how we figure out real file sizes and things
based on max size.

I think there's a ticket about this somewhere, but I can't find it off-hand...
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html