Re: cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Tue, 24 Apr 2018 10:20:18 +0200

That "nicely exporting" thing is a logging issue that was apparently
fixed in https://github.com/ceph/ceph/pull/19220. I'm not sure if that
will be backported to luminous.

Otherwise the slow requests could be due to either slow trimming (see
previous discussions about mds log max expiring and mds log max
segments options) or old clients failing to release caps correctly
(you would see appropriate warnings about this).

-- Dan

On Tue, Apr 24, 2018 at 9:34 AM, Linh Vu <vul@xxxxxxxxxxxxxx> wrote:
> Hi all,
>
>
> I have a cluster running cephfs on Luminous 12.2.4, using 2 active MDSes + 1
> standby. I have 3 shares: /projects, /home and /scratch, and I've decided to
> try manual pinning as described here:
> http://docs.ceph.com/docs/master/cephfs/multimds/
>
>
> /projects is pinned to mds.0 (rank 0)
>
> /home and /scratch are pinned to mds.1 (rank 1)
>
> Pinning is verified by `ceph daemon mds.$mds_hostname get subtrees | jq '.[]
> | [.dir.path, .auth_first, .export_pin]'`
>
>
> Clients mount either via ceph-fuse 12.2.4, or kernel client 4.15.13.
>
>
> On our test cluster (same version and setup), it works as I think it should.
> I simulate metadata load via mdtest (up to around 2000 req/s on each mds,
> which is VM with 4 cores, 16GB RAM), and loads on /projects go to mds.0,
> loads on the other shares go to mds.1. Nothing pops up in the logs. I can
> also successfully reset to no pinning (i.e using the default load balancing)
> via setting the ceph.dir.pin value to -1, and vice versa. All that happens
> is this show in the logs:
>
> ....  mds.mds1-test-ceph2 asok_command: get subtrees (starting...)
>
> ....  mds.mds1-test-ceph2 asok_command: get subtrees (complete)
>
> However, on our production cluster, with more powerful MDSes (10 cores
> 3.4GHz, 256GB RAM, much faster networking), I get this in the logs
> constantly:
>
> 2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting
> to mds.0 [dir 0x1000010cd91.1110* /home/ [2,head] auth{0=1017} v=5632699
> cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84
> 55=0+55) n(v245771 rc2018-04-24 16:28:32.830971 b233439385711
> 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1
> replicated=1 dirty=1 authpin=0 0x55691ccf1c00]
>
> To clarify, /home is pinned to mds.1, so there is no reason it should export
> this to mds.0, and the loads on both MDSes (req/s, network load, CPU load)
> are fairly low, lower than those on the test MDS VMs.
>
> Sometimes (depending on which mds starts first), I would get the same
> message but the other way around i.e "mds.0.migrator nicely exporting to
> mds.1" the workload that mds.0 should be doing. This only appears on one
> mds, never the other, until one is restarted.
>
> And we've had a couple of occasions where we get this sort of slow requests:
>
> 7fd401126700  0 log_channel(cluster) log [WRN] : slow request 7681.127406
> seconds old, received at 2018-04-20 08:17:35.970498:
> client_request(client.875554:238655 lookup #0x10038ff1eab/punim0116
> 2018-04-20 08:17:35.970319 caller_uid=10171, caller_gid=10000{10000,10123,})
> currently failed to authpin local pins
>
> Which then seems to snowball into thousands of slow requests, until mds.0 is
> restarted. When these slow requests happen, loads are fairly low on the
> active MDSes, although it is possible that the users could be doing
> something funky with metadata on production that I can't reproduce with
> mdtest.
>
> I thought the manual pinning likely isn't working as intended due to the
> "mds.1.migrator nicely exporting to mds.0" messages in the logs (to me it
> seems to indicate that we have a bad load balancing situation) but I can't
> seem to replicate this issue in test. Test cluster seems to be working as
> intended.
>
> Am I doing manual pinning right? Should I even be using it?
>
> Cheers,
> Linh
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com