Re: cephfs: some metadata operations take seconds to complete

Tyanko Aleksiev <tyanko.alexiev@xxxxxxxxx> · Tue, 17 Oct 2017 13:43:24 +0200

Thanks for the replies. 
I'll move all our testbed installation to Luminous and redo the tests.

Cheers,
Tyanko

On 17 October 2017 at 10:14, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
On Tue, Oct 17, 2017 at 1:07 AM, Tyanko Aleksiev

<tyanko.alexiev@xxxxxxxxx> wrote:

> Hi,

>

> At UZH we are currently evaluating cephfs as a distributed file system for

> the scratch space of an HPC installation. Some slow down of the metadata

> operations seems to occur under certain circumstances. In particular,

> commands issued after some big file deletion could take several seconds.

>

> Example:

>

> dd bs=$((1024*1024*128)) count=2048 if=/dev/zero of=./dd-test

> 274877906944 bytes (275 GB, 256 GiB) copied, 224.798 s, 1.2 GB/s

>

> dd bs=$((1024*1024*128)) count=2048 if=./dd-test of=./dd-test2

> 274877906944 bytes (275 GB, 256 GiB) copied, 1228.87 s, 224 MB/s

>

> ls; time rm dd-test2 ; time ls

> dd-test  dd-test2

>

> real    0m0.004s

> user    0m0.000s

> sys     0m0.000s

> dd-test

>

> real    0m8.795s

> user    0m0.000s

> sys     0m0.000s

>

> Additionally, the time it takes to complete the "ls" command appears to be

> proportional to the size of the deleted file. The issue described above is

> not limited to "ls" but extends to other commands:

>

> ls ; time rm dd-test2 ; time du -hs ./*

> dd-test  dd-test2

>

> real    0m0.003s

> user    0m0.000s

> sys     0m0.000s

> 128G    ./dd-test

>

> real    0m9.974s

> user    0m0.000s

> sys     0m0.000s

>

> What might be causing this behavior and eventually how could we improve it?

>

Seems like mds was waiting for journal flush, it can wait up to

'mds_tick_interval'. This issue should be fix in  luminous release.

Regards

Yan, Zheng

> Setup:

>

> - ceph version: 10.2.9, OS: Ubuntu 16.04, kernel: 4.8.0-58-generic,

> - 3 monitors,

> - 1 mds,

> - 3 storage nodes with 24 X 4TB disks on each node: 1 OSD/disk (72 OSDs in

> total). 4TB disks are used for the cephfs_data pool. Journaling is on SSDs,

> - we installed an 400GB NVMe disk on each storage node and aggregated the

> tree disks in crush rule. cephfs_metadata pool was then created using that

> rule and therefore is hosted on the NVMes. Journaling and data are on the

> same partition here.

>

> So far we are using the default ceph configuration settings.

>

> Clients are mounting the file system with the kernel driver using the

> following options (again default):

> "rw,noatime,name=admin,secret=<hidden>,acl,_netdev".

>

> Thank you in advance for the help.

>

> Cheers,

> Tyanko

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com