On Tue, 18 Jun 2013, Guido Winkelmann wrote: > Am Donnerstag, 13. Juni 2013, 01:58:08 schrieb Josh Durgin: > > On 06/11/2013 11:59 AM, Guido Winkelmann wrote: > > > > > - Write the data with a very large number of concurrent threads (1000+) > > > > Are you using rbd caching? If so, turning it off may help reproduce > > faster if it's related to the number of individual requests (since the > > cache may merge adjacent or overlapping requests). > > There shouldn't be any RBD caching involved. I'm using libvirt to start my > VMs, and when specifying the rbd volumes in the domain definition, I use the > cache="none" attribute. > > > > - In the middle of writing, take down one OSD. It seems to matter which > > > OSD > > > that is, so far I could only reproduce the bug taking down the third of > > > three OSDs > > > > You're killing the OSD process, and not rebooting the host? > > I shut down the user space ceph processes with /etc/init.d/ceph stop osd. > > > Which filesystem are the OSDs using? > > BTRFS Which kernel version? There was a recent bug (fixed in 3.9 or 3.8) that resolves a data corruption issue on file extension. > > > My setup is Ceph 0.61.2 on three machines, each running one OSD and one > > > MON. The last one is also running an MDS. The ceph.conf file is attached. > > > > > > I have just updated to 0.61.3 and plan on rerunning the test on that. > > > The platform is Fedora 18 in all cases with kernel 3.9.4-200.fc18.x86_64. > > > > If it's reproducible it'd be great to get logs from all osds with > > debug osd = 20, debug ms = 1, and debug filestore = 20. > > I've put those settings into the config file now, and, even though I have been > trying repeatedly for the last few days, now I cannot reproduce the bug > anymore :( > Maybe it was a problem with my test setup, maybe it was caused by some minor > thing that was fixed in 0.61.3. Worst case, it was one of those bugs that > disappear as soon as you enable debugging. > > For now, I am going to stop trying to reproduce this, working under the > assumption that it was either caused by something in my test setup (there was > a bug in there as well, specifically failure to check whether the writes > succeeded...) or fixed in 0.61.3. I will also disable those debug settings, > because they make my /var/log partitions fill up extremely fast and before > logrotate can do anything about that. > > I will keep running those tests in the background though, to see if any > problems decide to pop up again. Thanks! sage > > Guido > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com