Help, my Ceph cluster is losing data slowly over time. I keep finding files
that are the same length as they should be, but all the content has been
lost & replaced by nulls.
Here is an example:
(from a backup I have the original file)
[root@blotter docker]# ls -lart
/backup/space/docker/ceph-monitor/ceph-w-monitor.py
/space/docker/ceph-monitor/ceph-w-monitor.py
-rwxrwxrwx 1 root root 7237 Mar 12 07:34
/backup/space/docker/ceph-monitor/ceph-w-monitor.py
-rwxrwxrwx 1 root root 7237 Mar 12 07:34
/space/docker/ceph-monitor/ceph-w-monitor.py
[root@blotter docker]# sum
/backup/space/docker/ceph-monitor/ceph-w-monitor.py 19803 8
[root@blotter docker]# sum /space/docker/ceph-monitor/ceph-w-monitor.py
00000 8
If I had to _guess_ I would blame a recent change to the writeback cache
tier layer. I turned it off and flushed it last weekend....about the same
time I started to notice this data loss.
I disabled it using instructions from here:
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
Basically, I just set it to "forward" and then "flushed it".
ceph osd tier cache-mode ssd_cache forward
and
rados -p ssd_cache cache-flush-evict-all
After that I removed the overlay. But that failed (and still fails) with:
Finally, tried to remove the cache t from the backing pool, but that failed
(still fails) with:
$ ceph osd tier remove-overlay cephfs_data
Error EBUSY: pool 'cephfs_data' is in use by CephFS via its tier
At that point I thought, because I had set the cache-mode to "forward", it
would be safe to just leave it as is until I had time to debug further.
I should mention that after the cluster settled down and did some scrubbing,
there was one inconsistent page. I ran a "ceph fix page xxx" command to
resolve that and the health was good again.
I can do some experimenting this weekend if somebody wants to help me
through it. Otherwise I'll probably try to put the cache-tier back into
"writeback" to see if that helps. If not, I'll recreate the entire ceph
cluster.
Thanks,
Blade.
P.S. My cluster is made of mixed ARM and x86_64..
$ ceph version
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
# ceph version
ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
etc...
PPS:that are the same length as they should be, but all the content has been
lost & replaced by nulls.
Here is an example:
(from a backup I have the original file)
[root@blotter docker]# ls -lart
/backup/space/docker/ceph-monitor/ceph-w-monitor.py
/space/docker/ceph-monitor/ceph-w-monitor.py
-rwxrwxrwx 1 root root 7237 Mar 12 07:34
/backup/space/docker/ceph-monitor/ceph-w-monitor.py
-rwxrwxrwx 1 root root 7237 Mar 12 07:34
/space/docker/ceph-monitor/ceph-w-monitor.py
[root@blotter docker]# sum
/backup/space/docker/ceph-monitor/ceph-w-monitor.py 19803 8
[root@blotter docker]# sum /space/docker/ceph-monitor/ceph-w-monitor.py
00000 8
If I had to _guess_ I would blame a recent change to the writeback cache
tier layer. I turned it off and flushed it last weekend....about the same
time I started to notice this data loss.
I disabled it using instructions from here:
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
Basically, I just set it to "forward" and then "flushed it".
ceph osd tier cache-mode ssd_cache forward
and
rados -p ssd_cache cache-flush-evict-all
After that I removed the overlay. But that failed (and still fails) with:
Finally, tried to remove the cache t from the backing pool, but that failed
(still fails) with:
$ ceph osd tier remove-overlay cephfs_data
Error EBUSY: pool 'cephfs_data' is in use by CephFS via its tier
At that point I thought, because I had set the cache-mode to "forward", it
would be safe to just leave it as is until I had time to debug further.
I should mention that after the cluster settled down and did some scrubbing,
there was one inconsistent page. I ran a "ceph fix page xxx" command to
resolve that and the health was good again.
I can do some experimenting this weekend if somebody wants to help me
through it. Otherwise I'll probably try to put the cache-tier back into
"writeback" to see if that helps. If not, I'll recreate the entire ceph
cluster.
Thanks,
Blade.
P.S. My cluster is made of mixed ARM and x86_64..
$ ceph version
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
# ceph version
ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
etc...
$ ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
2456G 1492G 839G 34.16
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 139G 5.66 185G 36499
cephfs_data 1 235G 9.59 185G 102883
cephfs_metadata 2 33642k 0 185G 5530
ssd_cache 4 0 0 370G 0
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com