Re: More data corruption issues with RBD (Ceph 0.61.2)

Sage Weil <sage@xxxxxxxxxxx> · Tue, 18 Jun 2013 07:58:50 -0700 (PDT)

On Tue, 18 Jun 2013, Guido Winkelmann wrote:
> Am Donnerstag, 13. Juni 2013, 01:58:08 schrieb Josh Durgin:
> > On 06/11/2013 11:59 AM, Guido Winkelmann wrote:
> > 
> > > - Write the data with a very large number of concurrent threads (1000+)
> > 
> > Are you using rbd caching? If so, turning it off may help reproduce
> > faster if it's related to the number of individual requests (since the
> > cache may merge adjacent or overlapping requests).
> 
> There shouldn't be any RBD caching involved. I'm using libvirt to start my 
> VMs, and when specifying the rbd volumes in the domain definition, I use the 
> cache="none" attribute.
> 
> > > - In the middle of writing, take down one OSD. It seems to matter which
> > > OSD
> > > that is, so far I could only reproduce the bug taking down the third of
> > > three OSDs
> > 
> > You're killing the OSD process, and not rebooting the host?
> 
> I shut down the user space ceph processes with /etc/init.d/ceph stop osd.
> 
> > Which filesystem are the OSDs using?
> 
> BTRFS

Which kernel version?  There was a recent bug (fixed in 3.9 or 3.8) that 
resolves a data corruption issue on file extension.

> > > My setup is Ceph 0.61.2 on three machines, each running one OSD and one
> > > MON. The last one is also running an MDS. The ceph.conf file is attached.
> > > 
> > > I have just updated to 0.61.3 and plan on rerunning the test on that.
> > > The platform is Fedora 18 in all cases with kernel 3.9.4-200.fc18.x86_64.
> > 
> > If it's reproducible it'd be great to get logs from all osds with
> > debug osd = 20, debug ms = 1, and debug filestore = 20.
> 
> I've put those settings into the config file now, and, even though I have been 
> trying repeatedly for the last few days, now I cannot reproduce the bug 
> anymore :(
> Maybe it was a problem with my test setup, maybe it was caused by some minor 
> thing that was fixed in 0.61.3. Worst case, it was one of those bugs that 
> disappear as soon as you enable debugging.
> 
> For now, I am going to stop trying to reproduce this, working under the 
> assumption that it was either caused by something in my test setup (there was 
> a bug in there as well, specifically failure to check whether the writes 
> succeeded...) or fixed in 0.61.3. I will also disable those debug settings, 
> because they make my /var/log partitions fill up extremely fast and before 
> logrotate can do anything about that.
> 
> I will keep running those tests in the background though, to see if any 
> problems decide to pop up again.

Thanks!
sage

> 
> 	Guido
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com