Hello, this was of course discussed here in the very recent thread "data corruption with hammer" Read it, it contains fixes and a workaround as well. Also from that thread: http://tracker.ceph.com/issues/12814 You don't need to remove the cache tier to fix things. And as also discussed here, getting the last (header) object out of the cache requires stopping VMs in the case of RBD images and I would suspect something similar with cephfs. Christian On Fri, 25 Mar 2016 16:47:11 -0700 Blade Doyle wrote: > Help, my Ceph cluster is losing data slowly over time. I keep finding > files that are the same length as they should be, but all the content > has been lost & replaced by nulls. > > Here is an example: > > (from a backup I have the original file) > > [root@blotter docker]# ls -lart > /backup/space/docker/ceph-monitor/ceph-w-monitor.py > /space/docker/ceph-monitor/ceph-w-monitor.py > -rwxrwxrwx 1 root root 7237 Mar 12 07:34 > /backup/space/docker/ceph-monitor/ceph-w-monitor.py > -rwxrwxrwx 1 root root 7237 Mar 12 07:34 > /space/docker/ceph-monitor/ceph-w-monitor.py > > [root@blotter docker]# sum > /backup/space/docker/ceph-monitor/ceph-w-monitor.py 19803 8 > > [root@blotter docker]# sum /space/docker/ceph-monitor/ceph-w-monitor.py > 00000 8 > > > > If I had to _guess_ I would blame a recent change to the writeback cache > tier layer. I turned it off and flushed it last weekend....about the > same time I started to notice this data loss. > > I disabled it using instructions from here: > http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ > > Basically, I just set it to "forward" and then "flushed it". > ceph osd tier cache-mode ssd_cache forward > and > rados -p ssd_cache cache-flush-evict-all > > After that I removed the overlay. But that failed (and still fails) > with: > > > Finally, tried to remove the cache t from the backing pool, but that > failed (still fails) with: > > $ ceph osd tier remove-overlay cephfs_data > Error EBUSY: pool 'cephfs_data' is in use by CephFS via its tier > > At that point I thought, because I had set the cache-mode to "forward", > it would be safe to just leave it as is until I had time to debug > further. > > I should mention that after the cluster settled down and did some > scrubbing, there was one inconsistent page. I ran a "ceph fix page xxx" > command to resolve that and the health was good again. > > > I can do some experimenting this weekend if somebody wants to help me > through it. Otherwise I'll probably try to put the cache-tier back into > "writeback" to see if that helps. If not, I'll recreate the entire ceph > cluster. > > Thanks, > Blade. > > > P.S. My cluster is made of mixed ARM and x86_64.. > $ ceph version > ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) > > # ceph version > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > > etc... > > > PPS: > > $ ceph df > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 2456G 1492G 839G 34.16 > POOLS: > NAME ID USED %USED MAX AVAIL OBJECTS > rbd 0 139G 5.66 185G 36499 > cephfs_data 1 235G 9.59 185G 102883 > cephfs_metadata 2 33642k 0 185G 5530 > ssd_cache 4 0 0 370G 0 -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com