Re: Corrupted files on CephFS since Luminous upgrade

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Sat, Mar 3, 2018 at 6:17 PM, Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> wrote:
> On 3.3.2018 11:12, Yan, Zheng wrote:
>> On Tue, Feb 27, 2018 at 2:29 PM, Jan Pekař - Imatic <jan.pekar@xxxxxxxxx>
>> wrote:
>>> I think I hit the same issue.
>>> I have corrupted data on cephfs and I don't remember the same issue
>>> before
>>> Luminous (i did the same tests before).
>>> It is on my test 1 node cluster with lower memory then recommended (so
>>> server is swapping) but it shouldn't lose data (it never did before).
>>> So slow requests may appear in the log like Florent B mentioned.
>>> My test is to take some bigger files (few GB) and copy it to cephfs or
>>> from
>>> cephfs to cephfs and stress the cluster so data copying stall for a
>>> while.
>>> It will resume in few seconds/minutes and everything looks ok (no error
>>> on
>>> copying). But copied file may be corrupted silently.
>>> I checked wiles with MD5SUM and compared some corrupted files in detail.
>>> There were missing some 4MB blocks of data (cephfs object size) -
>>> corrupted
>>> file had that block of data filled with zeroes.
>>> My idea is, that there happen something wrong when cluster is under
>>> pressure
>>> and client want to save the block. Client gets OK and continues with
>>> another
>>> block so data is lost and corrupted block is filled with zeros.
>> was your cluster near full when the issue happens ?
> No, I changed limits to higher near full / backfill full values, so it was
> healthy.

Could you please run ceph-fuse with debug_ms=1. reproduce this issue,
identify lost data and send ceph-fuse log to us.

Yan, Zheng

>> Regards
>> Yan, Zheng
>>> I tried kernel client 4.x and ceph-fuse client with same result.
>>> I'm using erasure for cephfs data pool, cache tier and my storage is
>>> bluestore and filestore mixed.
>>> How can I help to debug or what should I do to help to find the problem?
>>> With regards
>>> Jan Pekar
>>> On 14.12.2017 15:41, Yan, Zheng wrote:
>>>> On Thu, Dec 14, 2017 at 8:52 PM, Florent B <florent@xxxxxxxxxxx> wrote:
>>>>> On 14/12/2017 03:38, Yan, Zheng wrote:
>>>>>> On Thu, Dec 14, 2017 at 12:49 AM, Florent B <florent@xxxxxxxxxxx>
>>>>>> wrote:
>>>>>>> Systems are on Debian Jessie : kernel 3.16.0-4-amd64 & libfuse
>>>>>>> 2.9.3-15.
>>>>>>> I don't know pattern of corruption, but according to error message in
>>>>>>> Dovecot, it seems to expect data to read but reach EOF.
>>>>>>> All seems fine using fuse_disable_pagecache (no more corruption, and
>>>>>>> performance increased : no more MDS slow requests on filelock
>>>>>>> requests).
>>>>>> I checked ceph-fuse changes since kraken, didn't find any clue. I
>>>>>> would be helpful if you can try recent version kernel.
>>>>>> Regards
>>>>>> Yan, Zheng
>>>>> Problem occurred this morning even with fuse_disable_pagecache=true.
>>>>> It seems to be a lock issue between imap & lmtp processes.
>>>>> Dovecot uses fcntl as locking method. Is there any change about it in
>>>>> Luminous ? I switched to flock to see if problem is still there...
>>>> I don't remenber there is any change.
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>> --
>>> ============
>>> Ing. Jan Pekař
>>> jan.pekar@xxxxxxxxx | +420603811737
>>> ----
>>> Imatic | Jagellonská 14 | Praha 3 | 130 00
>>> ============
>>> --
ceph-users mailing list

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux