Re: migrating cephfs metadata pool from spinning disk to SSD.

Bob Ababurko <bob@xxxxxxxxxxxx> · Thu, 6 Aug 2015 17:36:47 -0700

@John,  
Can you clarify which values would suggest that my metadata pool is too slow?   I have added a link that includes values for the "op_active" & "handle_client_request"....gathered in a crude fashion but should hopefully give enough data to paint a picture of what is happening.
http://pastebin.com/5zAG8VXT

thanks in advance,
Bob

On Thu, Aug 6, 2015 at 1:24 AM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
I should have probably condensed my finding over the course of the day into one post but, I guess that just not how i'm built.....
Another data point.  I ran the `ceph daemon mds.cephmds02 perf dump` in a while loop w/ 1 second sleep and grepping out the stats John mentioned and at times(~every 10-15 seconds), I have some large objector.op_active values.  After the high values hit, there are 5-10 seconds of zero values.  

    "handle_client_request": 5785438,
        "op_active": 2375,
        "handle_client_request": 5785438,
        "op_active": 2444,
        "handle_client_request": 5785438,
        "op_active": 2239,
        "handle_client_request": 5785438,
        "op_active": 1648,
        "handle_client_request": 5785438,
        "op_active": 1121,
        "handle_client_request": 5785438,
        "op_active": 709,
        "handle_client_request": 5785438,
        "op_active": 235,
        "handle_client_request": 5785572,
        "op_active": 0,
   ...............

Should I be concerned about these "op_active" values?  I see that in my narrow slice of output, "handle_client_request" does not increment.  What is happening there?

thanks,
Bob

On Wed, Aug 5, 2015 at 11:43 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
I found a way to get the stats you mentioned: mds_server.handle_client_request & objecter.op_active.  I can see these values when I run:
ceph daemon mds.<id> perf dump

I recently restarted the mds server so my stats reset but I still have something to share:

"mds_server.handle_client_request": 4406055
"objecter.op_active": 0

Should I assume that op_active might be operations in writes or reads that are queued?  I haven't been able to find anything describing what these stats actually mean so if anyone knows where to find them, please advise.

On Wed, Aug 5, 2015 at 4:59 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
I have installed diamond(built by ksingh found at https://github.com/ksingh7/ceph-calamari-packages) on the MDS node and I am not seeing the mds_server.handle_client_request OR objecter.op_active metrics being sent to graphite.  Mind you, this is not the graphite that is part of the calamari install but our own internal graphite cluster.  Perhaps that is the reason?  I could not get calamari working correctly on hammerhead/centos7.1 so I put it on pause for now to concentrate on the cluster itself.
Ultimately, I need to find a way to get a hold of these metrics to determine the health of my MDS so I can justify moving forward on a SSD based cephfs metadata pool.

On Wed, Aug 5, 2015 at 4:05 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
Hi John,
You are correct in that my expectations may be incongruent with what is possible with ceph(fs).  I'm currently copying many small files(images) from a netapp to the cluster...~35k sized files to be exact and the number of objects/files copied thus far is fairly significant(below in bold):

[bababurko@cephmon01 ceph]$ sudo rados df
pool name                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
cephfs_data       3289284749    163993660            0            0           0            0            0    328097038   3369847354
cephfs_metadata       133364       524363            0            0           0      3600023   5264453980     95600004   1361554516
rbd                        0            0            0            0           0            0            0            0            0
  total used      9297615196    164518023
  total avail    19990923044
  total space    29288538240

Yes, that looks like ~164 million objects copied to the cluster.  I would assume this will potentially be a burden to the MDS but I have yet to confirm with the ceph daemontool mds.<id>.  I cannot seem to run it on the mds host as it doesn't seem to know about that command:

[bababurko@cephmds01]$ sudo ceph daemonperf mds.cephmds01
no valid command found; 10 closest matches:
osd lost <int[0-]> {--yes-i-really-mean-it}
osd create {<uuid>}
osd primary-temp <pgid> <id>
osd primary-affinity <osdname (id|osd.id)> <float[0.0-1.0]>
osd reweight <int[0-]> <float[0.0-1.0]>
osd pg-temp <pgid> {<id> [<id>...]}
osd in <ids> [<ids>...]
osd rm <ids> [<ids>...]
osd down <ids> [<ids>...]
osd out <ids> [<ids>...]
Error EINVAL: invalid command

This fails in a similar manner on all the hosts in the cluster.  I'm very green w/ ceph and i'm probably missing something obvious.  Is there something I need to install to get access to the 'ceph daemonperf' command in hammerhead?

thanks,
Bob

On Wed, Aug 5, 2015 at 2:43 AM, John Spray <jspray@xxxxxxxxxx> wrote:
On Tue, Aug 4, 2015 at 10:36 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:

> My writes are not going as I would expect wrt to IOPS(50-1000 IOPs) & write

> throughput( ~25MB/s max).  I'm interested in understanding what it takes to

> create a SSD pool that I can then migrate the current Cephfs_metadata pool

> to.  I suspect that the spinning disk metadata pool is a bottleneck and I

> want to try to get the max performance out of this cluster to prove that we

> would build out a larger version.  One caveat is that I have copied about 4

> TB of data to the cluster via cephfs and dont want to lose the data so I

> obviously need to keep the metadata intact.

I'm a bit suspicious of this: your IOPS expectations sort of imply

doing big files, but you're then suggesting that metadata is the

bottleneck (i.e. small file workload).

There are lots of statistics that come out of the MDS, you may be

particular interested in mds_server.handle_client_request,

objecter.op_active, to work out if there really are lots of RADOS

operations getting backed up on the MDS (which would be the symptom of

a too-slow metadata pool).  "ceph daemonperf mds.<id>" may be some

help if you don't already have graphite or similar set up.

> If anyone has done this OR understands how this can be done, I would

> appreciate the advice.

You could potentially do this in a two-phase process where you

initially set a crush rule that includes both SSDs and spinners, and

then finally set a crush rule that just points to SSDs.  Obviously

that'll do lots of data movement, but your metadata is probably a fair

bit smaller than your data so that might be acceptable.

John

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com