cephfs luminous 12.2.4 - multi-active MDSes with manual pinning

Linh Vu <vul@xxxxxxxxxxxxxx> · Tue, 24 Apr 2018 07:34:46 +0000

Hi all,

I have a cluster running cephfs on Luminous 12.2.4, using 2 active MDSes + 1 standby. I have 3 shares: /projects, /home and /scratch, and I've decided to try manual pinning as described here: http://docs.ceph.com/docs/master/cephfs/multimds/

/projects is pinned to mds.0 (rank 0)
/home and /scratch are pinned to mds.1 (rank 1)

Pinning is verified by `ceph daemon mds.$mds_hostname get subtrees | jq '.[] | [.dir.path, .auth_first, .export_pin]'`

Clients mount either via ceph-fuse 12.2.4, or kernel client 4.15.13. 

On our test cluster (same version and setup), it works as I think it should. I simulate metadata load via mdtest (up to around 2000 req/s on each mds, which is VM with 4 cores, 16GB RAM), and loads on /projects go to
 mds.0, loads on the other shares go to mds.1. Nothing pops up in the logs. I can also successfully reset to no pinning (i.e using the default load balancing) via setting the ceph.dir.pin value to -1, and vice versa. All that happens is this show in the logs:

....  mds.mds1-test-ceph2
asok_command: get subtrees (starting...)

....  mds.mds1-test-ceph2 asok_command: get subtrees (complete)

However, on our production cluster, with more powerful MDSes (10 cores 3.4GHz, 256GB RAM, much faster networking), I get this in the logs constantly:

2018-04-24 16:29:21.998261 7f02d1af9700  0 mds.1.migrator nicely exporting to mds.0 [dir 0x1000010cd91.1110* /home/ [2,head] auth{0=1017} v=5632699 cv=5632651/5632651 dir_auth=1 state=1611923458|complete|auxsubtree f(v84 55=0+55) n(v245771 rc2018-04-24
 16:28:32.830971 b233439385711 423085=383063+40022) hs=55+0,ss=0+0 dirty=1 | child=1 frozen=0 subtree=1 replicated=1 dirty=1 authpin=0 0x55691ccf1c00]

To clarify, /home is pinned to mds.1, so there is no reason it should export this to mds.0, and the loads on both MDSes (req/s, network load, CPU load) are fairly low, lower than those on the test MDS VMs. 

Sometimes (depending on which mds starts first), I would get the same message but the other way around i.e "mds.0.migrator nicely exporting to mds.1" the workload that mds.0 should be doing. This only appears on one mds, never the other, until one is restarted. 

And we've had a couple of occasions where we get this sort of slow requests:

7fd401126700  0 log_channel(cluster) log [WRN] : slow request 7681.127406 seconds old, received at 2018-04-20 08:17:35.970498: client_request(client.875554:238655 lookup #0x10038ff1eab/punim0116 2018-04-20 08:17:35.970319
 caller_uid=10171, caller_gid=10000{10000,10123,}) currently failed to authpin local pins

Which then seems to snowball into thousands of slow requests, until mds.0 is restarted. When these slow requests happen, loads are fairly low on the active MDSes, although it is possible that the users could be doing something funky with metadata on production
 that I can't reproduce with mdtest. 

I thought the manual pinning likely isn't working as intended due to the "mds.1.migrator nicely exporting to mds.0" messages in the logs (to me it seems to indicate that we have a bad load balancing situation) but I can't seem to replicate this issue in test.
 Test cluster seems to be working as intended. 

Am I doing manual pinning right? Should I even be using it? 

Cheers,
Linh 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com