On Mon, May 14, 2018 at 5:37 PM, Josef Zelenka <josef.zelenka@xxxxxxxxxxxxxxxx> wrote: > Hi everyone, we've encountered an unusual thing in our setup(4 nodes, 48 > OSDs, 3 monitors - ceph Jewel, Ubuntu 16.04 with kernel 4.4.0). Yesterday, > we were doing a HW upgrade of the nodes, so they went down one by one - the > cluster was in good shape during the upgrade, as we've done this numerous > times and we're quite sure that the redundancy wasn't screwed up while doing > this. However, during this upgrade one of the clients that does backups to > cephfs(mounted via the kernel driver) failed to write the backup file > correctly to the cluster with the following trace after we turned off one of > the nodes: > > [2585732.529412] ffff8800baa279a8 ffffffff813fb2df ffff880236230e00 > ffff8802339c0000 > [2585732.529414] ffff8800baa28000 ffff88023fc96e00 7fffffffffffffff > ffff8800baa27b20 > [2585732.529415] ffffffff81840ed0 ffff8800baa279c0 ffffffff818406d5 > 0000000000000000 > [2585732.529417] Call Trace: > [2585732.529505] [<ffffffff813fb2df>] ? cpumask_next_and+0x2f/0x40 > [2585732.529558] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60 > [2585732.529560] [<ffffffff818406d5>] schedule+0x35/0x80 > [2585732.529562] [<ffffffff81843825>] schedule_timeout+0x1b5/0x270 > [2585732.529607] [<ffffffff810642be>] ? kvm_clock_get_cycles+0x1e/0x20 > [2585732.529609] [<ffffffff81840ed0>] ? bit_wait+0x60/0x60 > [2585732.529611] [<ffffffff8183fc04>] io_schedule_timeout+0xa4/0x110 > [2585732.529613] [<ffffffff81840eeb>] bit_wait_io+0x1b/0x70 > [2585732.529614] [<ffffffff81840c6e>] __wait_on_bit_lock+0x4e/0xb0 > [2585732.529652] [<ffffffff8118f3cb>] __lock_page+0xbb/0xe0 > [2585732.529674] [<ffffffff810c4460>] ? autoremove_wake_function+0x40/0x40 > [2585732.529676] [<ffffffff8119078d>] pagecache_get_page+0x17d/0x1c0 > [2585732.529730] [<ffffffffc056b3a8>] ? ceph_pool_perm_check+0x48/0x700 > [ceph] > [2585732.529732] [<ffffffff811907f6>] grab_cache_page_write_begin+0x26/0x40 > [2585732.529738] [<ffffffffc056a6a8>] ceph_write_begin+0x48/0xe0 [ceph] > [2585732.529739] [<ffffffff8118fd6e>] generic_perform_write+0xce/0x1c0 > [2585732.529763] [<ffffffff8122bdb9>] ? file_update_time+0xc9/0x110 > [2585732.529769] [<ffffffffc05651c9>] ceph_write_iter+0xf89/0x1040 [ceph] > [2585732.529792] [<ffffffff81199c19>] ? __alloc_pages_nodemask+0x159/0x2a0 > [2585732.529808] [<ffffffff8120fedb>] new_sync_write+0x9b/0xe0 > [2585732.529811] [<ffffffff8120ff46>] __vfs_write+0x26/0x40 > [2585732.529812] [<ffffffff812108c9>] vfs_write+0xa9/0x1a0 > [2585732.529814] [<ffffffff81211585>] SyS_write+0x55/0xc0 > [2585732.529817] [<ffffffff818447f2>] entry_SYSCALL_64_fastpath+0x16/0x71 > > is there any hang osd request in /sys/kernel/debug/ceph/xxxx/osdc? > I have encountered this behavior on Luminous, but not on Jewel. Anyone who > has a clue why the write fails? As far as i'm concerned, it should always > work if all the PGs are available. Thanks > Josef > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com