Can you dump the metadata ops in flight on each ceph-fuse when it hangs? ceph daemon </var/run/ceph/client.whatever.asok> mds_requests -Greg On Mon, Nov 9, 2015 at 8:06 AM, Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > Hi, > > On 11/09/2015 04:03 PM, Gregory Farnum wrote: >> >> On Mon, Nov 9, 2015 at 6:57 AM, Burkhard Linke >> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> Hi, >>> >>> On 11/09/2015 02:07 PM, Burkhard Linke wrote: >>>> >>>> Hi, >>> >>> *snipsnap* >>> >>>> >>>> Cluster is running Hammer 0.94.5 on top of Ubuntu 14.04. Clients use >>>> ceph-fuse with patches for improved page cache handling, but the problem >>>> also occur with the official hammer packages from download.ceph.com >>> >>> I've tested the same setup with clients running kernel 4.2.5 and using >>> the >>> kernel cephfs client. I was not able to reproduce the problem in that >>> setup. >> >> What's the workload you're running, precisely? I would not generally >> expect multiple accesses to a sqlite database to work *well*, but >> offhand I'm not entirely certain why it would work differently between >> the kernel and userspace clients. (Probably something to do with the >> timing of the shared requests and any writes happening.) > > Using SQLite on network filesystems is somewhat challenging, especially if > multiple instances write to the database. The reproducible test case does > not write to the database at all; it simply extracts the table structure > from the default database. The applications itself only read from the > database and do not modify anything. The underlying SQLite library may > attempt to use locking to protect certain operations. According to dmesg the > processes are blocked within fuse calls: > > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.543966] INFO: task > ceph-fuse:6298 blocked for more than 120 seconds. > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544014] Not tainted > 4.2.5-040205-generic #201510270124 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544054] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544119] ceph-fuse D > ffff881fbf8d64c0 0 6298 3262 0x00000100 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544125] ffff881f9768f838 > 0000000000000086 ffff883fb2d83700 ffff881f97b38dc0 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544130] 0000000000001000 > ffff881f97690000 ffff881fbf8d64c0 7fffffffffffffff > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544134] 0000000000000002 > ffffffff817dc300 ffff881f9768f858 ffffffff817dbb07 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544138] Call Trace: > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544147] [<ffffffff817dc300>] > ? bit_wait+0x50/0x50 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544156] [<ffffffff817deba9>] > schedule_timeout+0x189/0x250 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544166] [<ffffffff817dc300>] > ? bit_wait+0x50/0x50 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544176] [<ffffffff810bcb64>] > ? prepare_to_wait_exclusive+0x54/0x80 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544185] [<ffffffff817dc0bb>] > __wait_on_bit_lock+0x4b/0xa0 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544195] [<ffffffff810bd0e0>] > ? autoremove_wake_function+0x40/0x40 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544205] [<ffffffff8106d962>] > ? get_user_pages_fast+0x112/0x190 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544213] [<ffffffff812173df>] > ? ilookup5_nowait+0x6f/0x90 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544222] [<ffffffff812f922d>] > fuse_notify+0x14d/0x830 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544230] [<ffffffff812f85d4>] > ? fuse_copy_do+0x84/0xf0 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544239] [<ffffffff810a4f7d>] > ? ttwu_do_activate.constprop.89+0x5d/0x70 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544248] [<ffffffff811fc0dc>] > do_iter_readv_writev+0x6c/0xa0 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544257] [<ffffffff811bc9d8>] > ? mprotect_fixup+0x148/0x230 > Nov 9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544264] [<ffffffff811fdae9>] > SyS_writev+0x59/0xf0 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672548] Not tainted > 4.2.5-040205-generic #201510270124 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672654] ceph-fuse D > ffff881fbf8d64c0 0 6298 3262 0x00000100 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672665] 0000000000001000 > ffff881f97690000 ffff881fbf8d64c0 7fffffffffffffff > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672673] Call Trace: > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672687] [<ffffffff817dbb07>] > schedule+0x37/0x80 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672698] [<ffffffff8101dcd9>] > ? read_tsc+0x9/0x10 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672707] [<ffffffff817db114>] > io_schedule_timeout+0xa4/0x110 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672717] [<ffffffff817dc335>] > bit_wait_io+0x35/0x50 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672726] [<ffffffff8118186b>] > __lock_page+0xbb/0xe0 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672736] [<ffffffff811934cc>] > invalidate_inode_pages2_range+0x22c/0x460 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672745] [<ffffffff81304a80>] > ? fuse_init_file_inode+0x30/0x30 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672753] [<ffffffff813068a6>] > fuse_reverse_inval_inode+0x66/0x90 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672761] [<ffffffff813c8e12>] > ? iov_iter_get_pages+0xa2/0x220 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672770] [<ffffffff812f9f0d>] > fuse_dev_do_write+0x22d/0x380 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672779] [<ffffffff812fa41b>] > fuse_dev_write+0x5b/0x80 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672786] [<ffffffff811fcc66>] > do_readv_writev+0x196/0x250 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672796] [<ffffffff811fcda9>] > vfs_writev+0x39/0x50 > Nov 9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672803] [<ffffffff817dfb72>] > entry_SYSCALL_64_fastpath+0x16/0x75 > > > The fact that the kernel client is working so far may be timing related. > I've also done test runs on the cluster with 20 instance of the application > and a small dataset running in parallel without any problem so far. > > Best regards, > Burkhard > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com