Re: cephfs: Client hp-s3-r4-compute failing torespondtocapabilityrelease

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 10, 2015 at 12:06 AM, Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> On 11/09/2015 04:03 PM, Gregory Farnum wrote:
>>
>> On Mon, Nov 9, 2015 at 6:57 AM, Burkhard Linke
>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> On 11/09/2015 02:07 PM, Burkhard Linke wrote:
>>>>
>>>> Hi,
>>>
>>> *snipsnap*
>>>
>>>>
>>>> Cluster is running Hammer 0.94.5 on top of Ubuntu 14.04. Clients use
>>>> ceph-fuse with patches for improved page cache handling, but the problem
>>>> also occur with the official hammer packages from download.ceph.com
>>>
>>> I've tested the same setup with clients running kernel 4.2.5 and using
>>> the
>>> kernel cephfs client. I was not able to reproduce the problem in that
>>> setup.
>>
>> What's the workload you're running, precisely? I would not generally
>> expect multiple accesses to a sqlite database to work *well*, but
>> offhand I'm not entirely certain why it would work differently between
>> the kernel and userspace clients. (Probably something to do with the
>> timing of the shared requests and any writes happening.)
>
> Using SQLite on network filesystems is somewhat challenging, especially if
> multiple instances write to the database. The reproducible test case does
> not write to the database at all; it simply extracts the table structure
> from the default database. The applications itself only read from the
> database and do not modify anything. The underlying SQLite library may
> attempt to use locking to protect certain operations. According to dmesg the
> processes are blocked within fuse calls:
>
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.543966] INFO: task
> ceph-fuse:6298 blocked for more than 120 seconds.
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544014]       Not tainted
> 4.2.5-040205-generic #201510270124
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544054] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544119] ceph-fuse       D
> ffff881fbf8d64c0     0  6298   3262 0x00000100
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544125] ffff881f9768f838
> 0000000000000086 ffff883fb2d83700 ffff881f97b38dc0
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544130] 0000000000001000
> ffff881f97690000 ffff881fbf8d64c0 7fffffffffffffff
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544134] 0000000000000002
> ffffffff817dc300 ffff881f9768f858 ffffffff817dbb07
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544138] Call Trace:
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544147] [<ffffffff817dc300>]
> ? bit_wait+0x50/0x50
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544156] [<ffffffff817deba9>]
> schedule_timeout+0x189/0x250
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544166] [<ffffffff817dc300>]
> ? bit_wait+0x50/0x50
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544176] [<ffffffff810bcb64>]
> ? prepare_to_wait_exclusive+0x54/0x80
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544185] [<ffffffff817dc0bb>]
> __wait_on_bit_lock+0x4b/0xa0
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544195] [<ffffffff810bd0e0>]
> ? autoremove_wake_function+0x40/0x40
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544205] [<ffffffff8106d962>]
> ? get_user_pages_fast+0x112/0x190
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544213] [<ffffffff812173df>]
> ? ilookup5_nowait+0x6f/0x90
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544222] [<ffffffff812f922d>]
> fuse_notify+0x14d/0x830
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544230] [<ffffffff812f85d4>]
> ? fuse_copy_do+0x84/0xf0
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544239] [<ffffffff810a4f7d>]
> ? ttwu_do_activate.constprop.89+0x5d/0x70
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544248] [<ffffffff811fc0dc>]
> do_iter_readv_writev+0x6c/0xa0
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544257] [<ffffffff811bc9d8>]
> ? mprotect_fixup+0x148/0x230
> Nov  9 14:17:08 hp-s2-r2-compute kernel: [ 1081.544264] [<ffffffff811fdae9>]
> SyS_writev+0x59/0xf0
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672548]       Not tainted
> 4.2.5-040205-generic #201510270124
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672654] ceph-fuse       D
> ffff881fbf8d64c0     0  6298   3262 0x00000100
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672665] 0000000000001000
> ffff881f97690000 ffff881fbf8d64c0 7fffffffffffffff
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672673] Call Trace:
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672687] [<ffffffff817dbb07>]
> schedule+0x37/0x80
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672698] [<ffffffff8101dcd9>]
> ? read_tsc+0x9/0x10
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672707] [<ffffffff817db114>]
> io_schedule_timeout+0xa4/0x110
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672717] [<ffffffff817dc335>]
> bit_wait_io+0x35/0x50
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672726] [<ffffffff8118186b>]
> __lock_page+0xbb/0xe0
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672736] [<ffffffff811934cc>]
> invalidate_inode_pages2_range+0x22c/0x460
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672745] [<ffffffff81304a80>]
> ? fuse_init_file_inode+0x30/0x30
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672753] [<ffffffff813068a6>]
> fuse_reverse_inval_inode+0x66/0x90
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672761] [<ffffffff813c8e12>]
> ? iov_iter_get_pages+0xa2/0x220
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672770] [<ffffffff812f9f0d>]
> fuse_dev_do_write+0x22d/0x380
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672779] [<ffffffff812fa41b>]
> fuse_dev_write+0x5b/0x80
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672786] [<ffffffff811fcc66>]
> do_readv_writev+0x196/0x250
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672796] [<ffffffff811fcda9>]
> vfs_writev+0x39/0x50
> Nov  9 14:19:08 hp-s2-r2-compute kernel: [ 1201.672803] [<ffffffff817dfb72>]
> entry_SYSCALL_64_fastpath+0x16/0x75
>
>
> The fact that the kernel client is working so far may be timing related.
> I've also done test runs on the cluster with 20 instance of the application
> and a small dataset running in parallel without any problem so far.
>

it seems the hang is related to async invalidate.  please try the following patch
---
diff --git a/src/client/Client.cc b/src/client/Client.cc
index 0d85db2..afbb896 100644
--- a/src/client/Client.cc
+++ b/src/client/Client.cc
@@ -3151,8 +3151,6 @@ void Client::_async_invalidate(Inode *in, int64_t off, int64_t len, bool keep_ca
   ino_invalidate_cb(callback_handle, in->vino(), off, len);
 
   client_lock.Lock();
-  if (!keep_caps)
-    check_caps(in, false);
   put_inode(in);
   client_lock.Unlock();
   ldout(cct, 10) << "_async_invalidate " << off << "~" << len << (keep_caps ? " keep_caps" : "") << " done" << dendl;
@@ -3163,7 +3161,7 @@ void Client::_schedule_invalidate_callback(Inode *in, int64_t off, int64_t len,
   if (ino_invalidate_cb)
     // we queue the invalidate, which calls the callback and decrements the ref
     async_ino_invalidator.queue(new C_Client_CacheInvalidate(this, in, off, len, keep_caps));
-  else if (!keep_caps)
+  if (!keep_caps)
     check_caps(in, false);
 }
 



> Best regards,
> Burkhard
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux