On Sat, Mar 3, 2018 at 6:17 PM, Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> wrote: > On 3.3.2018 11:12, Yan, Zheng wrote: >> >> On Tue, Feb 27, 2018 at 2:29 PM, Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> >> wrote: >>> >>> I think I hit the same issue. >>> I have corrupted data on cephfs and I don't remember the same issue >>> before >>> Luminous (i did the same tests before). >>> >>> It is on my test 1 node cluster with lower memory then recommended (so >>> server is swapping) but it shouldn't lose data (it never did before). >>> So slow requests may appear in the log like Florent B mentioned. >>> >>> My test is to take some bigger files (few GB) and copy it to cephfs or >>> from >>> cephfs to cephfs and stress the cluster so data copying stall for a >>> while. >>> It will resume in few seconds/minutes and everything looks ok (no error >>> on >>> copying). But copied file may be corrupted silently. >>> >>> I checked wiles with MD5SUM and compared some corrupted files in detail. >>> There were missing some 4MB blocks of data (cephfs object size) - >>> corrupted >>> file had that block of data filled with zeroes. >>> >>> My idea is, that there happen something wrong when cluster is under >>> pressure >>> and client want to save the block. Client gets OK and continues with >>> another >>> block so data is lost and corrupted block is filled with zeros. >>> >> >> was your cluster near full when the issue happens ? > > > No, I changed limits to higher near full / backfill full values, so it was > healthy. > Could you please run ceph-fuse with debug_ms=1. reproduce this issue, identify lost data and send ceph-fuse log to us. Regards Yan, Zheng > >> >> Regards >> Yan, Zheng >> >>> I tried kernel client 4.x and ceph-fuse client with same result. >>> >>> I'm using erasure for cephfs data pool, cache tier and my storage is >>> bluestore and filestore mixed. >>> >>> How can I help to debug or what should I do to help to find the problem? >>> >>> With regards >>> Jan Pekar >>> >>> >>> On 14.12.2017 15:41, Yan, Zheng wrote: >>>> >>>> >>>> On Thu, Dec 14, 2017 at 8:52 PM, Florent B <florent@xxxxxxxxxxx> wrote: >>>>> >>>>> >>>>> On 14/12/2017 03:38, Yan, Zheng wrote: >>>>>> >>>>>> >>>>>> On Thu, Dec 14, 2017 at 12:49 AM, Florent B <florent@xxxxxxxxxxx> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Systems are on Debian Jessie : kernel 3.16.0-4-amd64 & libfuse >>>>>>> 2.9.3-15. >>>>>>> >>>>>>> I don't know pattern of corruption, but according to error message in >>>>>>> Dovecot, it seems to expect data to read but reach EOF. >>>>>>> >>>>>>> All seems fine using fuse_disable_pagecache (no more corruption, and >>>>>>> performance increased : no more MDS slow requests on filelock >>>>>>> requests). >>>>>> >>>>>> >>>>>> >>>>>> I checked ceph-fuse changes since kraken, didn't find any clue. I >>>>>> would be helpful if you can try recent version kernel. >>>>>> >>>>>> Regards >>>>>> Yan, Zheng >>>>> >>>>> >>>>> >>>>> Problem occurred this morning even with fuse_disable_pagecache=true. >>>>> >>>>> It seems to be a lock issue between imap & lmtp processes. >>>>> >>>>> Dovecot uses fcntl as locking method. Is there any change about it in >>>>> Luminous ? I switched to flock to see if problem is still there... >>>>> >>>> >>>> I don't remenber there is any change. >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>> -- >>> ============ >>> Ing. Jan Pekař >>> jan.pekar@xxxxxxxxx | +420603811737 >>> ---- >>> Imatic | Jagellonská 14 | Praha 3 | 130 00 >>> http://www.imatic.cz >>> ============ >>> -- _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com