On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote: > Hey everyone, > > It's my first post here to expose a potential issue I found today using > Ceph 0.56.1. > > The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS. > All nodes are running Exherbo (source-based distribution) with Ceph > 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which > is mounted on ~60 clients (increasing each day). Objects are replicated > three times and the cluster handles only 7GB of data atm for 350k > objects. > > In certain conditions (I don't know them atm), some clients hang, > generate CPU overloads (kworker) and are unable to make any IO on > Ceph. The active MDS have ~20Mbps in/out during the issue (less than > 2Mbps in normal activity). I don't know if it's directly linked but we > also observe a lot of missing files at the same time. > > The problem is similar to this one [1]. > > A restart of the client or the MDS was enough before today, but we found > a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours > with ~25% clients hanging. > > In logs I found a segfault with this backtrace [2] and 100,000 dumped > events during the first hang. We observed another hang which produces > lot of these events (in debug mode): > - "mds.0.server FAIL on ESTALE but attempting recovery" > - "mds.0.server reply_request -116 (Stale NFS file handle) > client_request(client.10991:1031 getattr As #1000004bab0 > RETRY=132)" > > We have no profiling tools available on these nodes, and I don't know > what I should search in the 35 GB log file. > > Note: the segmentation fault occured only once but the problem was > observed four times on this cluster. > > Any help may be appreciated. > > References: > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) > 1: /usr/bin/ceph-mds() [0x817e82] > 2: (()+0xf140) [0x7f9091d30140] > 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] > 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] > 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] > 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] > 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] > 8: (Server::kill_session(Session*)+0x137) [0x549c67] > 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] > 10: (MDS::tick()+0x338) [0x4da928] > 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] > 12: (SafeTimerThread::entry()+0xd) [0x782bad] > 13: (()+0x7ddf) [0x7f9091d28ddf] > 14: (clone()+0x6d) [0x7f90909cc24d] I found a possible cause/way to reproduce this issue. We have now ~90 clients for 18GB / 650k objects and the storm occurs when we execute an "intensive IO" command (tar of the whole pool / rsync in one folder) on one of our client (the only which uses ceph-fuse, don't know if it's limited to it or not). Any idea? Cheers, -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html