For those following along, looks like investigation of this has moved to the tracker now, where Zheng is investigating. -Greg On Fri, Nov 3, 2017 at 12:48 PM, Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > I've tested the 12.2.1 fuse client - and it also reproduces the problem > unfortunately. Investigating the code that accesses the file system, it > looks like multiple processes from multiple nodes write to the same file > concurrently, but to different byte ranges of it. Unfortunately the problem > happens some hours into the run of the code, so I can't really run the MDS > or fuse with a very high debug level that long. Well, perhaps fuse I could > run with a higher debug level on the nodes in question if that helps. > > Andras > > > > On 11/03/2017 12:29 AM, Gregory Farnum wrote: > > Either ought to work fine. > > On Thu, Nov 2, 2017 at 4:58 PM Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> > wrote: >> >> I'm planning to test the newer ceph-fuse tomorrow. Would it be better to >> stay with the Jewel 10.2.10 client, or would the 12.2.1 Luminous client be >> better (even though the back-end is Jewel for now)? >> >> >> Andras >> >> >> >> On 11/02/2017 05:54 PM, Gregory Farnum wrote: >> >> Have you tested on the new ceph-fuse? This does sound vaguely familiar and >> is an issue I'd generally expect to have the fix backported for, once it was >> identified. >> >> On Thu, Nov 2, 2017 at 11:40 AM Andras Pataki >> <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> We've been running into a strange problem with Ceph using ceph-fuse and >>> the filesystem. All the back end nodes are on 10.2.10, the fuse clients >>> are on 10.2.7. >>> >>> After some hours of runs, some processes get stuck waiting for fuse like: >>> >>> [root@worker1144 ~]# cat /proc/58193/stack >>> [<ffffffffa08cd241>] wait_answer_interruptible+0x91/0xe0 [fuse] >>> [<ffffffffa08cd653>] __fuse_request_send+0x253/0x2c0 [fuse] >>> [<ffffffffa08cd6d2>] fuse_request_send+0x12/0x20 [fuse] >>> [<ffffffffa08d69d6>] fuse_send_write+0xd6/0x110 [fuse] >>> [<ffffffffa08d84d5>] fuse_perform_write+0x2f5/0x5a0 [fuse] >>> [<ffffffffa08d8a21>] fuse_file_aio_write+0x2a1/0x340 [fuse] >>> [<ffffffff811fdfbd>] do_sync_write+0x8d/0xd0 >>> [<ffffffff811fe82d>] vfs_write+0xbd/0x1e0 >>> [<ffffffff811ff34f>] SyS_write+0x7f/0xe0 >>> [<ffffffff816975c9>] system_call_fastpath+0x16/0x1b >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> The cluster is healthy (all OSDs up, no slow requests, etc.). More >>> details of my investigation efforts are in the bug report I just >>> submitted: >>> http://tracker.ceph.com/issues/22008 >>> >>> It looks like the fuse client is asking for some caps that it never >>> thinks it receives from the MDS, so the thread waiting for those caps on >>> behalf of the writing client never wakes up. The restart of the MDS >>> fixes the problem (since ceph-fuse re-negotiates caps). >>> >>> Any ideas/suggestions? >>> >>> Andras >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com