Re: Ceph file system hang

David Turner <drakonstein@xxxxxxxxx> · Thu, 15 Jun 2017 17:45:48 +0000

Have you compared performance to mounting cephfs using ceph-fuse instead of the kernel client?  ceph-fuse is a package that will match your current version of ceph as opposed to the kernel client where you need to update your kernel to match the current version/features of ceph.  I switched to ceph-fuse for my cluster (drastically smaller and less utilized than yours) and it has been working smoother than when I was using the kernel client.  A very interesting thing that ceph-fuse does is that an ls -lhd of a directory shows the directory structures size.  It's a drastically faster response than a du for the size of a folder.
david@kaylee:/mnt/cephfs$ ls -lh
total 2.5K
drwxr-xr-x 1 david david  89G Dec 12  2016 fix/
drwxr-xr-x 1 david david 1.2T Dec  5  2016 active/
drwxr-xr-x 1 david david 7.0T Jan 20 18:40 archive/
drwxr-xr-x 1 david david    0 Jun 15 13:24 sort/
david@kaylee:/mnt/cephfs$ ls -lh archive/
total 2.0K
drwxr-xr-x 1 david david 6.5T Jun 11 13:47 book/
drwxr-xr-x 1 david david 587G Jun  7 10:51 zoe/

Another thing that strikes me odd is that you seem to be doing one of the no no's of distributed file systems.  It looks like you have some devs working on this project based on the multithreaded solution to place files into cephfs.  It's always best to query a database for information as opposed to the file system.  If I'm using a large distributed filesystem for something at work, I make sure that nothing is being placed into that filesystem without the database knowing everything it needs to about the file.  It's location, size, who the file belongs to, if the file has an expiration for when it should be deleted, etc.  You can always reach a scale where querying the filesystem for such information could take hours where a query to the database with a proper structure would return in seconds.

On the topic of running hourly snapshots of cephfs, are you monitoring how large your snap trim queue is?  I've found that deleting snapshots can cause a lot of slowdowns in the cluster and should be scheduled for a time when the cluster will be mostly idle to get through as much of the snapshot deletions as possible.  If you're deleting snapshots each hour as well, that might be a place to look for odd cluster happenings as well.

On Thu, Jun 15, 2017 at 12:39 PM Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> wrote:
We are running Ceph 10.2.7 and after adding a new multi-threaded

writer application we are seeing hangs accessing metadata from ceph

file system kernel mounted clients.  I have a "du -ah /cephfs" process

that been stuck for over 12 hours on one cephfs client system.  We

started seeing  hung "du -ah" processes two days ago, so yesterday we

upgraded the whole cluster from v10.2.5 to v10.2.7, but the problem

occurred again last night.  Rebooting the client fixes the problem.

The ceph -s command is showing HEALTH_OK

We have four ceph file system clients, each kernel mounting our 1 ceph

file system to /cephfs. The "du -ah /cephfs" runs hourly within a test

script that is cron controlled.  If the du -ah /cephfs does not

complete within an hour, emails are sent to the admin group as part of

our monitoring process. This command normally takes less then a minute

to run and we have just over 3.6M files in this file system.  The du

-ah is hanging while accessing sub-directories where the new

multi-threaded writer application is writing.

About the application: On one ceph client we are downloading external

data via the network and writing data as files with a python program

into the ceph file system. The python script can write up to 100 files

in parallel. The metadata hangs we are seeing can occur on one or more

client systems, but right now it is only hung on one system, which is

not the node writing the data.

System info:

ceph -s

    cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4

     health HEALTH_OK

     monmap e1: 3 mons at

{mon01=10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0}

            election epoch 138, quorum 0,1,2 mon01,mon02,mon03

      fsmap e3210: 1/1/1 up {0=mds02=up:active}, 2 up:standby

     osdmap e33046: 85 osds: 85 up, 85 in

            flags sortbitwise,require_jewel_osds

      pgmap v27679236: 16192 pgs, 12 pools, 7655 GB data, 6591 kobjects

            24345 GB used, 217 TB / 241 TB avail

               16188 active+clean

                   3 active+clean+scrubbing

                   1 active+clean+scrubbing+deep

  client io 0 B/s rd, 15341 kB/s wr, 0 op/s rd, 21 op/s wr

On the hung client node, we are seeing an entry in  mdsc

cat /sys/kernel/debug/ceph/*/mdsc

163925513 mds0 readdir #100003be2b1 kplr009658474_dr25_window.fits

I am not seeing this on the other 3 client nodes.

On the active metdata server, I ran:

ceph daemon mds.mds02 dump_ops_in_flight

every 2 seconds, as it kept changing.  Part of the output is at:

https://paste.fedoraproject.org/paste/OizCowo3oGzZo-cJWV5R~Q

Info about the system

OS: Ubuntu Trusty

Cephfs snapshots are turned on and being created hourly

Ceph Version

ceph -v

ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)

Kernel: Ceph Servers:

uname -a

Linux mon01 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22

15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Kernel Cephfs clients:

uname -a

Linux dfgw02 4.9.21-040921-generic #201704080434 SMP Sat Apr 8

08:35:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Let me know if I should write up a ticket on this.

Thanks

Eric

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com