Have you compared performance to mounting cephfs using ceph-fuse instead of the kernel client? ceph-fuse is a package that will match your current version of ceph as opposed to the kernel client where you need to update your kernel to match the current version/features of ceph. I switched to ceph-fuse for my cluster (drastically smaller and less utilized than yours) and it has been working smoother than when I was using the kernel client. A very interesting thing that ceph-fuse does is that an ls -lhd of a directory shows the directory structures size. It's a drastically faster response than a du for the size of a folder.
david@kaylee:/mnt/cephfs$ ls -lh
total 2.5K
drwxr-xr-x 1 david david 89G Dec 12 2016 fix/
drwxr-xr-x 1 david david 1.2T Dec 5 2016 active/
drwxr-xr-x 1 david david 7.0T Jan 20 18:40 archive/
drwxr-xr-x 1 david david 0 Jun 15 13:24 sort/
david@kaylee:/mnt/cephfs$ ls -lh archive/
total 2.0K
drwxr-xr-x 1 david david 6.5T Jun 11 13:47 book/
drwxr-xr-x 1 david david 587G Jun 7 10:51 zoe/
Another thing that strikes me odd is that you seem to be doing one of the no no's of distributed file systems. It looks like you have some devs working on this project based on the multithreaded solution to place files into cephfs. It's always best to query a database for information as opposed to the file system. If I'm using a large distributed filesystem for something at work, I make sure that nothing is being placed into that filesystem without the database knowing everything it needs to about the file. It's location, size, who the file belongs to, if the file has an expiration for when it should be deleted, etc. You can always reach a scale where querying the filesystem for such information could take hours where a query to the database with a proper structure would return in seconds.
On the topic of running hourly snapshots of cephfs, are you monitoring how large your snap trim queue is? I've found that deleting snapshots can cause a lot of slowdowns in the cluster and should be scheduled for a time when the cluster will be mostly idle to get through as much of the snapshot deletions as possible. If you're deleting snapshots each hour as well, that might be a place to look for odd cluster happenings as well.
On Thu, Jun 15, 2017 at 12:39 PM Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> wrote:
We are running Ceph 10.2.7 and after adding a new multi-threaded
writer application we are seeing hangs accessing metadata from ceph
file system kernel mounted clients. I have a "du -ah /cephfs" process
that been stuck for over 12 hours on one cephfs client system. We
started seeing hung "du -ah" processes two days ago, so yesterday we
upgraded the whole cluster from v10.2.5 to v10.2.7, but the problem
occurred again last night. Rebooting the client fixes the problem.
The ceph -s command is showing HEALTH_OK
We have four ceph file system clients, each kernel mounting our 1 ceph
file system to /cephfs. The "du -ah /cephfs" runs hourly within a test
script that is cron controlled. If the du -ah /cephfs does not
complete within an hour, emails are sent to the admin group as part of
our monitoring process. This command normally takes less then a minute
to run and we have just over 3.6M files in this file system. The du
-ah is hanging while accessing sub-directories where the new
multi-threaded writer application is writing.
About the application: On one ceph client we are downloading external
data via the network and writing data as files with a python program
into the ceph file system. The python script can write up to 100 files
in parallel. The metadata hangs we are seeing can occur on one or more
client systems, but right now it is only hung on one system, which is
not the node writing the data.
System info:
ceph -s
cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4
health HEALTH_OK
monmap e1: 3 mons at
{mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}
election epoch 138, quorum 0,1,2 mon01,mon02,mon03
fsmap e3210: 1/1/1 up {0=mds02=up:active}, 2 up:standby
osdmap e33046: 85 osds: 85 up, 85 in
flags sortbitwise,require_jewel_osds
pgmap v27679236: 16192 pgs, 12 pools, 7655 GB data, 6591 kobjects
24345 GB used, 217 TB / 241 TB avail
16188 active+clean
3 active+clean+scrubbing
1 active+clean+scrubbing+deep
client io 0 B/s rd, 15341 kB/s wr, 0 op/s rd, 21 op/s wr
On the hung client node, we are seeing an entry in mdsc
cat /sys/kernel/debug/ceph/*/mdsc
163925513 mds0 readdir #100003be2b1 kplr009658474_dr25_window.fits
I am not seeing this on the other 3 client nodes.
On the active metdata server, I ran:
ceph daemon mds.mds02 dump_ops_in_flight
every 2 seconds, as it kept changing. Part of the output is at:
https://paste.fedoraproject.org/paste/OizCowo3oGzZo-cJWV5R~Q
Info about the system
OS: Ubuntu Trusty
Cephfs snapshots are turned on and being created hourly
Ceph Version
ceph -v
ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
Kernel: Ceph Servers:
uname -a
Linux mon01 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22
15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Kernel Cephfs clients:
uname -a
Linux dfgw02 4.9.21-040921-generic #201704080434 SMP Sat Apr 8
08:35:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Let me know if I should write up a ticket on this.
Thanks
Eric
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com