Re: MDS has inconsistent performance

Michael Sevilla <mikesevilla3@xxxxxxxxx> · Tue, 13 Jan 2015 15:45:29 -0800

On Tue, Jan 13, 2015 at 11:13 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Mon, Jan 12, 2015 at 10:17 PM, Michael Sevilla
> <mikesevilla3@xxxxxxxxx> wrote:
>> I can't get consistent performance with 1 MDS. I have 2 clients create
>> 100,000 files (separate directories) in a CephFS mount. I ran the
>> experiment 5 times (deleting the pools/fs and restarting the MDS in
>> between each run). I graphed the metadata throughput (requests per
>> second): https://github.com/michaelsevilla/mds/blob/master/graphs/thruput.png
>
> So that top line is ~20,000 processed requests/second, as measured at
> the MDS? (Looking at perfcounters?) And the fast run is doing 10k
> create requests/second? (This number is much higher than I expected!)

Yes - top line was 20K req/s from perf counter dump and the fast run
does about 13K creates/s. We were surprised, too... In fact, the
performance of 1 client per MDS gives us similar performance to
IndexFS - a system that came out in a paper at Supercomputing this
year. Here is a throughput graph, normalized to the # of clients, that
shows how powerful one MDS can actually be:
https://github.com/michaelsevilla/mds/blob/master/graphs/thruput-norm.png

Keep in mind that runs with more than 1 client aren't creates/s, but ops/sec. ;)

>
>> Sometimes (run0, run3), both clients issue 2 lookups per create to the
>> MDS - this makes throughput high but the runtime long since the MDS
>> processes many more requests.
>> Sometimes (run2, run4), 1 client does 2 lookups per create and the
>> other doesn't do any lookups.
>> Sometimes (run1), neither client does any lookups - this has the
>> fastest runtime.
>>
>> Does anyone know why the client behaves differently for the same exact
>> experiment? Reading the client logs, it looks like sometimes the
>> client enters add_update_cap() and clears the inode->flags in
>> check_cap_issue(), then when a lookup occurs (in _lookup()), the
>> client can't return ENOENT locally -- forcing it ask the MDS to do the
>> lookup. But this only happens sometimes (e.g., run0 and run3).
>
> If you provide the logs I can check more carefully, but my guess is
> that you've got another client mounting it, or are looking at both
> directories from one of the clients, and this is inadvertently causing
> them to go into shared rather than exclusive mode.

I think you are right! Here is a subset of the client log:
https://github.com/michaelsevilla/mds/blob/master/scratch/client0.log

These snippets are zoomed into when the client stops sending "create,
create, create, create..." and starts sending "lookup, lookup, create,
lookup, lookup, create..."

$ cat client0.log | grep "send_request client"
create ...file.2098
create ...file.2099
create ...file.2100
create ...file.2101
lookup ...file.2102
lookup ...file.2102
create ...file.2102
lookup ...file.2103
lookup ...file.2103
create ...file.2103
lookup ...file.2104
lookup ...file.2104
create ...file.2104

I think what you are looking for is on line 687:
... clearing (I_COMPLETE|I_DIR_ORDERED)
... add_update_cap issued pAsLsXs -> pAsLsXsFsx

It looks like we lose the exclusive mode on the file... but I don't
understand why the MDS revokes it for 1 client but not the other. The
MDS log is here:
https://raw.githubusercontent.com/michaelsevilla/mds/master/scratch/mds.log

>
> How are you trying to keep the directories private during the
> workload? Some of the more naive solutions won't stand up to
> repetitive testing given how various components of the system
> currently behave.
Is there a way to keep the directories private (i.e. keep the always
in exclusive mode? That'd be perfect... In my runs, one client does
mkdir /mnt/cephfs/dir0 and there other does mdkir /mnt/cephfs/dir1...

>
>>
>> Details of the experiment:
>> Workload: 2 clients, 100,000 creates in separate directories, using
>> the FUSE client
>> MDS config: client_cache_size = 100000000, mds_cache_size = 16384000
>
> That client_cache_size only has any effect if it's applied to the
> client-side config. ;)
Yes - I copy the ceph.conf to the client, too. I think it works
because the 1 client, 1 MDS test caches all the inodes, according the
perf counters.

Thanks so much, Greg!

Mike

> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html