RE: Ceph data consistency

Sage Weil <sweil@xxxxxxxxxx> · Tue, 6 Jan 2015 18:18:01 -0800 (PST)

On Wed, 7 Jan 2015, Ma, Jianpeng wrote:
> > ---------- Forwarded message ----------
> > From: Pawe? Sadowski <ceph@xxxxxxxxx>
> > Date: 2014-12-30 21:40 GMT+08:00
> > Subject: Re: Ceph data consistency
> > To: Vijayendra Shamanna <Vijayendra.Shamanna@xxxxxxxxxxx>,
> > "ceph-devel@xxxxxxxxxxxxxxx" <ceph-devel@xxxxxxxxxxxxxxx>
> > 
> > 
> > On 12/30/2014 01:40 PM, Vijayendra Shamanna wrote:
> > > Hi,
> > >
> > > There is a sync thread (sync_entry in FileStore.cc) which triggers
> > > periodically and executes sync_filesystem() to ensure that the data is
> > > consistent. The journal entries are trimmed only after a successful
> > > sync_filesystem() call
> > 
> > sync_filesystem() always returns zero and journal will be trimmed.
> > Executing sync()/syncfs() with dirty data in disk buffers will result in data loss
> > ("lost page write due to I/O error").
> > 
> Hi sage:
> 
> From the git log, I see at first sync_filesystem() return the result of syncfs().
> But in this commit 808c644248e486f44:
>     Improve use of syncfs.
>     Test syncfs return value and fallback to btrfs sync and then sync.
> The author hope if syncfs() met error and sync() can resolve. Because sync() don't return result 
> So it only return zero.
> But which error can handle by this way? AFAK, no.
> I suggest it directly return result of syncfs().

Yeah, that sounds right!

sage

> 
> Jianpeng Ma
> Thanks!
> 
> 
> > I was doing some experiments simulating disk errors using Device Mapper
> > "error" target. In this setup OSD was writing to broken disk without crashing.
> > Every 5 seconds (filestore_max_sync_interval) kernel logs that some data were
> > discarded due to IO error.
> > 
> > 
> > > Thanks
> > > Viju
> > >> -----Original Message-----
> > >> From: ceph-devel-owner@xxxxxxxxxxxxxxx
> > >> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Pawel Sadowski
> > >> Sent: Tuesday, December 30, 2014 1:52 PM
> > >> To: ceph-devel@xxxxxxxxxxxxxxx
> > >> Subject: Ceph data consistency
> > >>
> > >> Hi,
> > >>
> > >> On our Ceph cluster from time to time we have some inconsistent PGs (after
> > deep-scrub). We have some issues with disk/sata cables/lsi controller causing
> > IO errors from time to time (but that's not the point in this case).
> > >>
> > >> When IO error occurs on OSD journal partition everything works as is should
> > -> OSD is crashed and that's ok - Ceph will handle that.
> > >>
> > >> But when IO error occurs on OSD data partition during journal flush OSD
> > continue to work. After calling *writev* (in buffer::list::write_fd) OSD does
> > check return code from this call but does NOT verify if write has been successful
> > to disk (data are still only >in memory and there is no fsync). That way OSD
> > thinks that data has been stored on disk but it might be discarded (during sync
> > dirty page will be reclaimed and you'll see "lost page write due to I/O error" in
> > dmesg).
> > >>
> > >> Since there is no checksumming of data I just wanted to make sure that this
> > is by design. Maybe there is a way to tell OSD to call fsync after write and have
> > data consistent?
> > 
> > --
> > PS
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body
> > of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> N?????r??y??????X???v???)?{.n?????z?]z????ay?????j??f???h??????w??????j:+v???w????????????zZ+???????j"????i
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html