Re: cephfs: some metadata operations take seconds to complete

Linh Vu <vul@xxxxxxxxxxxxxx> · Mon, 16 Oct 2017 21:56:13 +0000

We're using cephfs here as well for HPC scratch, but we're on Luminous 12.2.1. This issue seems to have been fixed between Jewel and Luminous, we don't have such problems. :) Any reason you guys aren't evaluating the latest LTS?

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Tyanko Aleksiev <tyanko.alexiev@xxxxxxxxx>

Sent: Tuesday, 17 October 2017 4:07:26 AM

To: ceph-users

Subject:  cephfs: some metadata operations take seconds to complete

Hi,

At UZH we are currently evaluating cephfs as a distributed file system 

for the scratch space of an HPC installation. Some slow down of the 

metadata operations seems to occur under certain circumstances. In 

particular, commands issued after some big file deletion could take 

several seconds.

Example:

dd bs=$((1024*1024*128)) count=2048 if=/dev/zero of=./dd-test

274877906944 bytes (275 GB, 256 GiB) copied, 224.798 s, 1.2 GB/s

dd bs=$((1024*1024*128)) count=2048 if=./dd-test of=./dd-test2

274877906944 bytes (275 GB, 256 GiB) copied, 1228.87 s, 224 MB/s

ls; time rm dd-test2 ; time ls

dd-test  dd-test2

real    0m0.004s

user    0m0.000s

sys     0m0.000s

dd-test

real    0m8.795s

user    0m0.000s

sys     0m0.000s

Additionally, the time it takes to complete the "ls" command appears to 

be proportional to the size of the deleted file. The issue described 

above is not limited to "ls" but extends to other commands:

ls ; time rm dd-test2 ; time du -hs ./*

dd-test  dd-test2

real    0m0.003s

user    0m0.000s

sys     0m0.000s

128G    ./dd-test

real    0m9.974s

user    0m0.000s

sys     0m0.000s

What might be causing this behavior and eventually how could we improve it?

Setup:

- ceph version: 10.2.9, OS: Ubuntu 16.04, kernel: 4.8.0-58-generic,

- 3 monitors,

- 1 mds,

- 3 storage nodes with 24 X 4TB disks on each node: 1 OSD/disk (72 OSDs 

in total). 4TB disks are used for the cephfs_data pool. Journaling is on 

SSDs,

- we installed an 400GB NVMe disk on each storage node and aggregated 

the tree disks in crush rule. cephfs_metadata pool was then created 

using that rule and therefore is hosted on the NVMes. Journaling and 

data are on the same partition here.

So far we are using the default ceph configuration settings.

Clients are mounting the file system with the kernel driver using the 

following options (again default): 

"rw,noatime,name=admin,secret=<hidden>,acl,_netdev".

Thank you in advance for the help.

Cheers,

Tyanko

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com