Re: MPI applications on ceph fs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sage and ceph-developers,
I reproduced the error again and captured file descriptors and log information. I've quoted them below. Here is a brief summary:
I have ceph 0.48 running on a single mulitprocessor node. I have two clients that have mounted the cephfs over infiniband (same error happens under GB though).  I start an parallel program with mpi on the two clients. It starts 7 processes on each client. One process (id 1728 below) opens 43 files at start up and the remaining 13 processes open up two files each. I've quotes the output of the system file descriptors below. The processes run fine with I/O going to files for about 25s when the process with 43 files locks up. At 33s the remaining processes are locked and accesses to the mounted ceph fs also lock.

I hope these details along with the mds log quotes in my last message help isolate this problem.
Is it recommended to upgrade from 0.48, or are there backports to 0.48 I should incorporate? 
thanks,
-john

Here's the debug feedback you asked for:

  # cat /sys/kernel/debug/ceph/*/mdsc
     1028	mds0	getattr	 #100000003ee

The main process with 43 files:
 #+begin_quote stat /proc/1728/fd/*
[root@eaps-80-35e ~]# stat /proc/1728/fd/*
  File: `/proc/1728/fd/0' -> `pipe:[29056]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25862       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/1' -> `pipe:[29057]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25863       Links: 1
Access: (0300/l-wx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/10' -> `anon_inode:[infinibandevent]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25872       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/11' -> `pipe:[29058]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25873       Links: 1
Access: (0300/l-wx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/12' -> `/dev/infiniband/uverbs0'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25874       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/13' -> `anon_inode:[infinibandevent]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25875       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/14' -> `/'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25876       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/15' -> `/dev/shm/ib_shmem-kvs_2025_0-eaps-80-35e.acesgrid.org-50329.tmp (deleted)'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25877       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/16' -> `/dev/shm/ib_pool-kvs_2025_0-eaps-80-35e.acesgrid.org-50329.tmp (deleted)'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25878       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/17' -> `/dev/shm/ib_shmem_coll-kvs_2025_0-eaps-80-35e.acesgrid.org-50329.tmp (deleted)'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25879       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/18' -> `/mnt/ceph/runtemplate/STDERR.0000'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25880       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/19' -> `/mnt/ceph/runtemplate/STDOUT.0000'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25881       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/2' -> `pipe:[29058]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25864       Links: 1
Access: (0300/l-wx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/20' -> `/mnt/ceph/runtemplate/stats/PO4.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25882       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/21' -> `/mnt/ceph/runtemplate/stats/NO3.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25883       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/22' -> `/mnt/ceph/runtemplate/stats/FeT.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25884       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/23' -> `/mnt/ceph/runtemplate/stats/Si.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25885       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/24' -> `/mnt/ceph/runtemplate/stats/DOP.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25886       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/25' -> `/mnt/ceph/runtemplate/stats/DON.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25887       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/26' -> `/mnt/ceph/runtemplate/stats/DOFe.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25888       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/27' -> `/mnt/ceph/runtemplate/stats/POP.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25889       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/28' -> `/mnt/ceph/runtemplate/stats/PON.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25890       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/29' -> `/mnt/ceph/runtemplate/stats/POFe.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25891       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/3' -> `socket:[22813]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25865       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/30' -> `/mnt/ceph/runtemplate/stats/POSi.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25892       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/31' -> `/mnt/ceph/runtemplate/stats/NH4.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25893       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/32' -> `/mnt/ceph/runtemplate/stats/NO2.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25894       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/33' -> `/mnt/ceph/runtemplate/stats/DIC.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25895       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/34' -> `/mnt/ceph/runtemplate/stats/DOC.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25896       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/35' -> `/mnt/ceph/runtemplate/stats/POC.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25897       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/36' -> `/mnt/ceph/runtemplate/stats/PIC.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25898       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/37' -> `/mnt/ceph/runtemplate/stats/ALK.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25899       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/38' -> `/mnt/ceph/runtemplate/stats/O2.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25900       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/39' -> `/mnt/ceph/runtemplate/stats/ZooP.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25901       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/4' -> `/dev/infiniband/uverbs0'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25866       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/40' -> `/mnt/ceph/runtemplate/stats/ZooN.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25902       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/41' -> `/mnt/ceph/runtemplate/stats/ZooFe.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25903       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/42' -> `/mnt/ceph/runtemplate/stats/ZooSi.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25904       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/43' -> `/mnt/ceph/runtemplate/stats/Phy.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25905       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/44' -> `/mnt/ceph/runtemplate/stats/ZooC.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25906       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/45' -> `/mnt/ceph/runtemplate/stats/PP.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25907       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/46' -> `/mnt/ceph/runtemplate/stats/PAR.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25908       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/47' -> `/mnt/ceph/runtemplate/stats/Diver1.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25909       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/48' -> `/mnt/ceph/runtemplate/stats/Diver2.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25910       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/49' -> `/mnt/ceph/runtemplate/stats/Diver3.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25911       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/5' -> `socket:[29055]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25867       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/50' -> `/mnt/ceph/runtemplate/stats/Diver4.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25912       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/51' -> `/mnt/ceph/runtemplate/stats/DICTFLX.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25913       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/52' -> `/mnt/ceph/runtemplate/stats/DICCFLX.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25914       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/53' -> `/mnt/ceph/runtemplate/stats/DICOFLX.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25915       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/54' -> `/mnt/ceph/runtemplate/stats/DICPCO2.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25916       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/55' -> `/mnt/ceph/runtemplate/stats/DICPHAV.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25917       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/56' -> `/mnt/ceph/runtemplate/stats/ChlCloer.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25918       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/57' -> `/mnt/ceph/runtemplate/stats/ChlGeide.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25919       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/58' -> `/mnt/ceph/runtemplate/stats/ChlDoney.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25920       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/59' -> `/mnt/ceph/runtemplate/stats/Shannon.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25921       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/6' -> `pipe:[29056]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25868       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/60' -> `/mnt/ceph/runtemplate/stats/Simpson.0000000000.txt'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25922       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/7' -> `anon_inode:[infinibandevent]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25869       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/8' -> `/dev/infiniband/uverbs0'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25870       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
  File: `/proc/1728/fd/9' -> `pipe:[29057]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25871       Links: 1
Access: (0300/l-wx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:32:40.468849356 -0400
Modify: 2012-09-04 12:32:40.468849356 -0400
Change: 2012-09-04 12:32:40.468849356 -0400
 Birth: -
     #+end_quote

An example of the other 13 processes:
     #+begin_quote stat /proc/1729/fd/*
     [root@eaps-80-35e ~]# stat /proc/1729/fd/*
  File: `/proc/1729/fd/0' -> `/dev/infiniband/uverbs0'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25840       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/1' -> `pipe:[29061]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25841       Links: 1
Access: (0300/l-wx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/10' -> `pipe:[29058]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25850       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/11' -> `pipe:[29061]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25851       Links: 1
Access: (0300/l-wx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/12' -> `anon_inode:[infinibandevent]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25852       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/13' -> `pipe:[29062]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25853       Links: 1
Access: (0300/l-wx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/14' -> `/'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25854       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/15' -> `/dev/shm/ib_shmem-kvs_2025_0-eaps-80-35e.acesgrid.org-50329.tmp (deleted)'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25855       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/16' -> `/dev/shm/ib_pool-kvs_2025_0-eaps-80-35e.acesgrid.org-50329.tmp (deleted)'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25856       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/17' -> `/dev/shm/ib_shmem_coll-kvs_2025_0-eaps-80-35e.acesgrid.org-50329.tmp (deleted)'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25857       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/18' -> `/mnt/ceph/runtemplate/STDERR.0002'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25858       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/19' -> `/mnt/ceph/runtemplate/STDOUT.0002'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25859       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/2' -> `pipe:[29062]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25842       Links: 1
Access: (0300/l-wx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/3' -> `socket:[22813]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25843       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/4' -> `anon_inode:[infinibandevent]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25844       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/5' -> `/dev/infiniband/uverbs0'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25845       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/6' -> `socket:[29060]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25846       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/7' -> `anon_inode:[infinibandevent]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25847       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/8' -> `pipe:[29057]'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25848       Links: 1
Access: (0500/lr-x------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
  File: `/proc/1729/fd/9' -> `/dev/infiniband/uverbs0'
  Size: 64        	Blocks: 0          IO Block: 1024   symbolic link
Device: 3h/3d	Inode: 25849       Links: 1
Access: (0700/lrwx------)  Uid: (50329/jcwright)   Gid: (  101/     mit)
Access: 2012-09-04 12:31:32.957857330 -0400
Modify: 2012-09-04 12:31:32.957857330 -0400
Change: 2012-09-04 12:31:32.957857330 -0400
 Birth: -
   #+end_quote


  # 


On Aug 31, 2012, at 4:03 PM, Sage Weil wrote:

> Hi John,
> 
> On Fri, 31 Aug 2012, John C. Wright wrote:
>> An update,
>> While looking into how to switch over to a different network on my ceph cluster - another question altogether, I discovered during my upgrade from 0.47.2 to 0.48 on the three nodes somehow my 'git checkout 0.48' wasn't done correctly on two nodes and wound up reinstalling 0.47.2 on those so somehow had a heterogenous cluster running 0.47.2 on the osds and 0.48 running on the mds (all three running monitors).
>> 
>> So wiped out and started fresh with 0.48 and still got the error, but with more info this time.
>> This MPI code is running on 14 processes on two client nodes. Each process writes to its own 'stdout' file and some other data files are created by process '0'. The program starts up and creates its initial files and begins to write to its stdout files. Normally this proceeds with writes to the stdout at periodic intervals during the run, but on the ceph volume, this freezes up after about one minute.
>> 
>> Symptoms: initially can still access ceph volume from other clients. Listing the working directory of the mpi code is very slow and soon is unresponsive but can still list other directories on the ceph volume. After another minute or so, ceph clients can no longer access the volume at all without locking up in a trace that ends in a 'fastpath' kernel call. If I CTRL-C out of the mpirun call within the first minute, everything recovers, but waiting longer than that requires a reboot of mounted nodes and a restart of ceph to clear things up.
>> 
>> Below are relevant (I hope) process traces from dmesg and ceph logs. Any help on diagnosing this would be greatly appreciated. We're hoping to use ceph as a parallel file system on a scientific workload beowulf cluster, initially with a buyer-beware policy an for only transient reproducible data and more general usage as ceph gets stable and reaches the 1.0 milestone.
> 
> It looks a bit like this is a problem on the MDS side of things, since you
> have both a hung request and a writer waiting for caps.  Can you generate 
> an MDS log that goes with this workload?  With
> 
> 	debug ms = 1
> 	debug mds = 20
> 
> in the [mds] section of your config?  Also, once it hangs, it would be 
> helpful to see what the hung request is (cat 
> /sys/kernel/debug/ceph/*/mdsc) and the inode number for the hung writer 
> (stat /proc/$pid/fd/NNN).  Hopefully the stat won't hang.. but if it does, 
> hopefully you can identify which file ino or filename it is some other 
> way.
> 
> Thanks!
> sage
> 
> 
>> 
>> Thanks very much.
>> 
>> -john wright

*Removed old quoted part of thread*

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux