RE: Ceph data consistency

"Ma, Jianpeng" <jianpeng.ma@xxxxxxxxx> · Wed, 7 Jan 2015 05:59:03 +0000

Hi Sage,
   Pull request is https://github.com/ceph/ceph/pull/3305.

Thanks!
Jianpeng Ma

> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Wednesday, January 7, 2015 10:18 AM
> To: Ma, Jianpeng
> Cc: ceph@xxxxxxxxx; Vijayendra.Shamanna@xxxxxxxxxxx;
> ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: Ceph data consistency
> 
> On Wed, 7 Jan 2015, Ma, Jianpeng wrote:
> > > ---------- Forwarded message ----------
> > > From: Pawe? Sadowski <ceph@xxxxxxxxx>
> > > Date: 2014-12-30 21:40 GMT+08:00
> > > Subject: Re: Ceph data consistency
> > > To: Vijayendra Shamanna <Vijayendra.Shamanna@xxxxxxxxxxx>,
> > > "ceph-devel@xxxxxxxxxxxxxxx" <ceph-devel@xxxxxxxxxxxxxxx>
> > >
> > >
> > > On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
> > > > Hi,
> > > >
> > > > There is a sync thread (sync_entry in FileStore.cc) which triggers
> > > > periodically and executes sync_filesystem() to ensure that the
> > > > data is consistent. The journal entries are trimmed only after a
> > > > successful
> > > > sync_filesystem() call
> > >
> > > sync_filesystem() always returns zero and journal will be trimmed.
> > > Executing sync()/syncfs() with dirty data in disk buffers will
> > > result in data loss ("lost page write due to I/O error").
> > >
> > Hi sage:
> >
> > From the git log, I see at first sync_filesystem() return the result of syncfs().
> > But in this commit 808c644248e486f44:
> >     Improve use of syncfs.
> >     Test syncfs return value and fallback to btrfs sync and then sync.
> > The author hope if syncfs() met error and sync() can resolve. Because
> > sync() don't return result So it only return zero.
> > But which error can handle by this way? AFAK, no.
> > I suggest it directly return result of syncfs().
> 
> Yeah, that sounds right!
> 
> sage
> 
> 
> >
> > Jianpeng Ma
> > Thanks!
> >
> >
> > > I was doing some experiments simulating disk errors using Device
> > > Mapper "error" target. In this setup OSD was writing to broken disk
> without crashing.
> > > Every 5 seconds (filestore_max_sync_interval) kernel logs that some
> > > data were discarded due to IO error.
> > >
> > >
> > > > Thanks
> > > > Viju
> > > >> -----Original Message-----
> > > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > > >> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Pawel
> > > >> Sadowski
> > > >> Sent: Tuesday, December 30, 2014 1:52 PM
> > > >> To: ceph-devel@xxxxxxxxxxxxxxx
> > > >> Subject: Ceph data consistency
> > > >>
> > > >> Hi,
> > > >>
> > > >> On our Ceph cluster from time to time we have some inconsistent
> > > >> PGs (after
> > > deep-scrub). We have some issues with disk/sata cables/lsi
> > > controller causing IO errors from time to time (but that's not the point in
> this case).
> > > >>
> > > >> When IO error occurs on OSD journal partition everything works as
> > > >> is should
> > > -> OSD is crashed and that's ok - Ceph will handle that.
> > > >>
> > > >> But when IO error occurs on OSD data partition during journal
> > > >> flush OSD
> > > continue to work. After calling *writev* (in buffer::list::write_fd)
> > > OSD does check return code from this call but does NOT verify if
> > > write has been successful to disk (data are still only >in memory
> > > and there is no fsync). That way OSD thinks that data has been
> > > stored on disk but it might be discarded (during sync dirty page
> > > will be reclaimed and you'll see "lost page write due to I/O error" in dmesg).
> > > >>
> > > >> Since there is no checksumming of data I just wanted to make sure
> > > >> that this
> > > is by design. Maybe there is a way to tell OSD to call fsync after
> > > write and have data consistent?
> > >
> > > --
> > > PS
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> N?????r??y??????X???v???)?{.n?????z?]z????ay?????j ??f???h??????w?
> ??
> 
> ???j:+v???w???????? ????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html