Hello again! On Linux, if you run OSD on ext4 filesystem, have a cephfs kernel client mount on the same system and no syncfs system call (as to be expected with libc6 < 2.14 or kernel < 2.6.39), OSD deadlocks in sys_sync(). Only reboot recovers the system. After some investigation in the code, this is what I found: In src/common/sync_filesystem.h, the function sync_filesystem() first tries a syncfs() (not available), then a btrfs ioctrl sync (not available with non-btrfs), then finally a sync(). sys_sync tries to sync all filesystems, including the journal device, the osd storage area and the cephfs mount. Under some load, when OSD calls sync(), cephfs sync waits for the local osd, which already waits for its storage to sync, which the kernel wants to do after the cephfs sync. Deadlock. The function sync_filesystem() is called by FileStore::sync_entry() in src/os/FileStore.cc, but only on non-btrfs storage and if filestore_fsync_flushes_journal_data is false. After forcing this to true in OSD config, our test cluster survived three days of heavy load (and still running fine) instead of deadlocking all nodes within an hour. Reproduced with 0.47.2 and kernel 3.2.18, but the related code seems unchanged in current master. Conclusion: If you want to run OSD and cephfs kernel client on the same Linux server and have a libc6 before 2.14 (e.g. Debian's newest in experimental is 2.13) or a kernel before 2.6.39, either do not use ext4 (but btrfs is still unstable) or risk data loss by missing syncs through the workaround of forcing filestore_fsync_flushes_journal_data to true. Please consider putting out a fat warning at least at build time, if syncfs() is not available, e.g. "No syncfs() syscall, please expect a deadlock when running osd on non-btrfs together with a local cephfs mount." Even better would be a quick runtime test for missing syncfs() and storage on non-btrfs that spits out a warning, if deadlock is possible. As a side effect, the experienced lockup seems to be a good way to reproduce the long standing bug 1047 - when our cluster tried to recover, all MDS instances died with those symptoms. It seems that a partial sync of journal or data partition causes that broken state. Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am Köllnischen Park 1 Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html