Re: Corrupted files on CephFS since Luminous upgrade

"Yan, Zheng" <ukernel@xxxxxxxxx> · Sat, 3 Mar 2018 18:12:31 +0800

On Tue, Feb 27, 2018 at 2:29 PM, Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> wrote:
> I think I hit the same issue.
> I have corrupted data on cephfs and I don't remember the same issue before
> Luminous (i did the same tests before).
>
> It is on my test 1 node cluster with lower memory then recommended (so
> server is swapping) but it shouldn't lose data (it never did before).
> So slow requests may appear in the log like Florent B mentioned.
>
> My test is to take some bigger files (few GB) and copy it to cephfs or from
> cephfs to cephfs and stress the cluster so data copying stall for a while.
> It will resume in few seconds/minutes and everything looks ok (no error on
> copying). But copied file may be corrupted silently.
>
> I checked wiles with MD5SUM and compared some corrupted files in detail.
> There were missing some 4MB blocks of data (cephfs object size) - corrupted
> file had that block of data filled with zeroes.
>
> My idea is, that there happen something wrong when cluster is under pressure
> and client want to save the block. Client gets OK and continues with another
> block so data is lost and corrupted block is filled with zeros.
>

was your cluster near full when the issue happens ?

Regards
Yan, Zheng

> I tried kernel client 4.x and ceph-fuse client with same result.
>
> I'm using erasure for cephfs data pool, cache tier and my storage is
> bluestore and filestore mixed.
>
> How can I help to debug or what should I do to help to find the problem?
>
> With regards
> Jan Pekar
>
>
> On 14.12.2017 15:41, Yan, Zheng wrote:
>>
>> On Thu, Dec 14, 2017 at 8:52 PM, Florent B <florent@xxxxxxxxxxx> wrote:
>>>
>>> On 14/12/2017 03:38, Yan, Zheng wrote:
>>>>
>>>> On Thu, Dec 14, 2017 at 12:49 AM, Florent B <florent@xxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> Systems are on Debian Jessie : kernel 3.16.0-4-amd64 & libfuse
>>>>> 2.9.3-15.
>>>>>
>>>>> I don't know pattern of corruption, but according to error message in
>>>>> Dovecot, it seems to expect data to read but reach EOF.
>>>>>
>>>>> All seems fine using fuse_disable_pagecache (no more corruption, and
>>>>> performance increased : no more MDS slow requests on filelock
>>>>> requests).
>>>>
>>>>
>>>> I checked ceph-fuse changes since kraken, didn't find any clue. I
>>>> would be helpful if you can try recent version kernel.
>>>>
>>>> Regards
>>>> Yan, Zheng
>>>
>>>
>>> Problem occurred this morning even with fuse_disable_pagecache=true.
>>>
>>> It seems to be a lock issue between imap & lmtp processes.
>>>
>>> Dovecot uses fcntl as locking method. Is there any change about it in
>>> Luminous ? I switched to flock to see if problem is still there...
>>>
>>
>> I don't remenber there is any change.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> --
> ============
> Ing. Jan Pekař
> jan.pekar@xxxxxxxxx | +420603811737
> ----
> Imatic | Jagellonská 14 | Praha 3 | 130 00
> http://www.imatic.cz
> ============
> --
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com