A question related to the performance degradation with multple MDS daemons under burst metadata operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, I'm Jongyul Kim and interested in the performance of Ceph.
I tried to figure out the advantage of using 2 MDS daemons instead of a single MDS under massive metadata operations(rename). But the result was that two MDS daemons performed worse than a single MDS daemon. I'd like to ask your advice why this happens.

Here is what I did.

I wrote a micro benchmark that 1) creates a file, 2) writes 4KB to the file and 3) rename it to another directory. These three steps are done by each process and I measured the throughput (operations/sec) of Ceph increasing the number of processes in the benchmark. The experimental setup is like below.

[Ceph and HW configuration]
  • Ceph version: 14.2.1
  • Configured as Filestore
  • NVM with ext4-dax was used as a storage device
  • IPoIB with 40Gbps Infiniband NIC
  • Sufficient cores and memory (96 cores with hyperthreading, about 300GB DRAM)
[Base setup]
  • There are two nodes: Node A and Node B
    • Node A: 1 MON, 1 MGR, 1 OSD, 1 MDS and the micro benchmark run.
    • Node B: 1 OSD and the micro benchmark run.
  • Each process does 1,000 operations(OPs).
  • Each process has its own source directory and target directory for renaming. So, there is no contention between rename requests of different processes; Each process renames 1,000 files from its own source directory to its own target directory.
As I increased the number of processes of the micro benchmark, Ceph stopped to scale (from the perspective of throughput) around 8 processes per node. I expect the bottleneck is MDS daemon because rename requests took more time as the number of processes increased. (I.e. the rename portion of total execution time: 10% with 1 process per node --> 24% with 8 processes per node)

I added one more active MDS daemon on Node B (Now, there are two MDS daemons. One on Node A and the other on Node B) to achieve higher throughput. Additionally, directories were pinned to one of two nodes for sharding metadata operations. That is, directories accessed by processes of Node A were pinned to MDS daemon running on Node A (in the same way for Node B). The result was, as I mentioned at the beginning, 2 MDS achieved lower throughput, about 50~60% of 1 MDS case's. The rename portion of the total execution time also increased (50% with 1 process per node and 88% with 8 processes per node).

I found the rename request to the MDS on Node B takes much longer time than a request to the MDS on Node A. (MDS on Node A has an authority of '/' directory. So, it seems to be a master MDS.) I checked the logs and confirmed that a rename MDS request in Node B was re-dispatched three times again in the MDS Server to acquire permissions for renaming (One for "pin inode" and two for "scatter locks". I'm not sure why scatter locks should be requested twice.), whereas there was no re-dispatch of a rename MDS request in MDS Server on Node A.

Although directories were pinned to the MDS on Node B, this MDS continually requested permissions to the MDS on Node A on every rename request. As a result, a directory pinning for metadata operation sharding was useless and 2 MDS gets worse performance.

Why this happens? Why the second MDS needs to re-acquire permissions, "pin inode" and the other scatter locks, on every rename request, even though this MDS has an authority for the directories (by directory pinning)? The performance would increase if an authorized MDS (MDS on Node B in this case) keeps locks or the "pin inode" permission of its authorized directories until a revocation actually required. It will eliminate the ping-ponging of locks and permissions between MDSes. But, current Ceph MDS implementation does not do in this way. I ask advice of you for the rationale of this design point.

Any comments and advice will be appreciated. Thanks.

Sincerely,
Jongyul Kim
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux