On Tue, Feb 27, 2018 at 2:29 PM, Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> wrote: > I think I hit the same issue. > I have corrupted data on cephfs and I don't remember the same issue before > Luminous (i did the same tests before). > > It is on my test 1 node cluster with lower memory then recommended (so > server is swapping) but it shouldn't lose data (it never did before). > So slow requests may appear in the log like Florent B mentioned. > > My test is to take some bigger files (few GB) and copy it to cephfs or from > cephfs to cephfs and stress the cluster so data copying stall for a while. > It will resume in few seconds/minutes and everything looks ok (no error on > copying). But copied file may be corrupted silently. > > I checked wiles with MD5SUM and compared some corrupted files in detail. > There were missing some 4MB blocks of data (cephfs object size) - corrupted > file had that block of data filled with zeroes. > > My idea is, that there happen something wrong when cluster is under pressure > and client want to save the block. Client gets OK and continues with another > block so data is lost and corrupted block is filled with zeros. > was your cluster near full when the issue happens ? Regards Yan, Zheng > I tried kernel client 4.x and ceph-fuse client with same result. > > I'm using erasure for cephfs data pool, cache tier and my storage is > bluestore and filestore mixed. > > How can I help to debug or what should I do to help to find the problem? > > With regards > Jan Pekar > > > On 14.12.2017 15:41, Yan, Zheng wrote: >> >> On Thu, Dec 14, 2017 at 8:52 PM, Florent B <florent@xxxxxxxxxxx> wrote: >>> >>> On 14/12/2017 03:38, Yan, Zheng wrote: >>>> >>>> On Thu, Dec 14, 2017 at 12:49 AM, Florent B <florent@xxxxxxxxxxx> wrote: >>>>> >>>>> >>>>> Systems are on Debian Jessie : kernel 3.16.0-4-amd64 & libfuse >>>>> 2.9.3-15. >>>>> >>>>> I don't know pattern of corruption, but according to error message in >>>>> Dovecot, it seems to expect data to read but reach EOF. >>>>> >>>>> All seems fine using fuse_disable_pagecache (no more corruption, and >>>>> performance increased : no more MDS slow requests on filelock >>>>> requests). >>>> >>>> >>>> I checked ceph-fuse changes since kraken, didn't find any clue. I >>>> would be helpful if you can try recent version kernel. >>>> >>>> Regards >>>> Yan, Zheng >>> >>> >>> Problem occurred this morning even with fuse_disable_pagecache=true. >>> >>> It seems to be a lock issue between imap & lmtp processes. >>> >>> Dovecot uses fcntl as locking method. Is there any change about it in >>> Luminous ? I switched to flock to see if problem is still there... >>> >> >> I don't remenber there is any change. >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > -- > ============ > Ing. Jan Pekař > jan.pekar@xxxxxxxxx | +420603811737 > ---- > Imatic | Jagellonská 14 | Praha 3 | 130 00 > http://www.imatic.cz > ============ > -- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com