Re: Losing data in healthy cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

this was of course discussed here in the very recent thread
"data corruption with hammer" 

Read it, it contains fixes and a workaround as well.
Also from that thread: http://tracker.ceph.com/issues/12814

You don't need to remove the cache tier to fix things.

And as also discussed here, getting the last (header) object out of the
cache requires stopping VMs in the case of RBD images and I would suspect
something similar with cephfs.

Christian


On Fri, 25 Mar 2016 16:47:11 -0700 Blade Doyle wrote:

> Help, my Ceph cluster is losing data slowly over time.  I keep finding
> files that are the same length as they should be, but all the content
> has been lost & replaced by nulls.
> 
> Here is an example:
> 
> (from a backup I have the original file)
> 
> [root@blotter docker]# ls -lart
> /backup/space/docker/ceph-monitor/ceph-w-monitor.py
> /space/docker/ceph-monitor/ceph-w-monitor.py
> -rwxrwxrwx 1 root root 7237 Mar 12 07:34
> /backup/space/docker/ceph-monitor/ceph-w-monitor.py
> -rwxrwxrwx 1 root root 7237 Mar 12 07:34
> /space/docker/ceph-monitor/ceph-w-monitor.py
> 
> [root@blotter docker]# sum
> /backup/space/docker/ceph-monitor/ceph-w-monitor.py 19803     8
> 
> [root@blotter docker]# sum /space/docker/ceph-monitor/ceph-w-monitor.py
> 00000     8
> 
> 
> 
> If I had to _guess_ I would blame a recent change to the writeback cache
> tier layer.  I turned it off and flushed it last weekend....about the
> same time I started to notice this data loss.
> 
> I disabled it using instructions from here:
> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
> 
> Basically, I just set it to "forward" and then "flushed it".
> ceph osd tier cache-mode ssd_cache forward
> and
> rados -p ssd_cache cache-flush-evict-all
> 
> After that I removed the overlay.  But that failed (and still fails)
> with:
> 
> 
> Finally, tried to remove the cache t from the backing pool, but that
> failed (still fails) with:
> 
> $ ceph osd tier remove-overlay cephfs_data
> Error EBUSY: pool 'cephfs_data' is in use by CephFS via its tier
> 
> At that point I thought, because I had set the cache-mode to "forward",
> it would be safe to just leave it as is until I had time to debug
> further.
> 
> I should mention that after the cluster settled down and did some
> scrubbing, there was one inconsistent page.  I ran a "ceph fix page xxx"
> command to resolve that and the health was good again.
> 
> 
> I can do some experimenting this weekend if somebody wants to help me
> through it.  Otherwise I'll probably try to put the cache-tier back into
> "writeback" to see if that helps.  If not, I'll recreate the entire ceph
> cluster.
> 
> Thanks,
> Blade.
> 
> 
> P.S. My cluster is made of mixed ARM and x86_64..
> $ ceph version
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
> 
> # ceph version
> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
> 
> etc...
> 
> 
> PPS:
> 
> $ ceph df
> GLOBAL:
>     SIZE      AVAIL     RAW USED     %RAW USED
>     2456G     1492G         839G         34.16
> POOLS:
>     NAME                ID     USED       %USED     MAX AVAIL     OBJECTS
>     rbd                 0        139G      5.66          185G       36499
>     cephfs_data         1        235G      9.59          185G      102883
>     cephfs_metadata     2      33642k         0          185G        5530
>     ssd_cache           4           0         0          370G           0


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux