Hello, I'm quite concerned by this (and the silence from the devs), however there are a number of people doing similar things (at least with Hammer) and you'd think they would have been bitten by this if it were a systemic bug. More below. On Sat, 6 Feb 2016 11:31:51 +0100 Udo Waechter wrote: > Hello, > > I am experiencing totally weird filesystem corruptions with the > following setup: > > * Ceph infernalis on Debian8 Hammer here, might be a regression. > * 10 OSDs (5 hosts) with spinning disks > * 4 OSDs (1 host, with SSDs) > So you're running your cache tier host with replication of 1, I presume? What kind of SSDs/FS/other relevant configuration options? Could there be simply some corruption on the SSDs that is of course then presented to the RDB clients eventually? > The SSDs are new in my setup and I am trying to setup a Cache tier. > > Now, with the spinning disks Ceph is running since about a year without > any major issues. Replacing disks and all that went fine. > > Ceph is used by rbd+libvirt+kvm with > > rbd_cache = true > rbd_cache_writethrough_until_flush = true > rbd_cache_size = 128M > rbd_cache_max_dirty = 96M > > Also, in libvirt, I have > > cachemode=writeback enabled. > > So far so good. > > Now, I've added the SSD-Cache tier to the picture with "cache-mode > writeback" > > The SSD-Machine also has "deadline" scheduler enabled. > > Suddenly VMs start to corrupt their filesystems (all ext4) with "Journal > failed". > Trying to reboot the machines ends in "No bootable drive" > Using parted and testdisk on the image mapped via rbd reveals that the > partition table is gone. > Did turning the cache explicitly off (both Ceph and qemu) fix this? > testdisk finds the proper ones, e2fsck repairs the filesystem beyond > usage afterwards. > > This does not happen to all machines, It happens to those that actually > do some or most fo the IO > > elasticsearch, MariaDB+Galera, postgres, backup, GIT > > So I thought, yesterday one of my ldap-servers died, and that one is not > doing IO. > > Could it be that rbd caching + qemu writeback cache + ceph cach tier > writeback are not playing well together? > > I've read through some older mails on the list, where people had similar > problems and suspected somehting like that. > Any particular references (URLs, Message-IDs)? Regards, Christian > What are the proper/right settings for rdb/qemu/libvirt? > > libvirt: cachemode=none (writeback?) > rdb: cache_mode = none > SSD-tier: cachemode: writeback > > ? > > Thanks for any help, > udo. > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com