On Wed, Oct 21, 2015 at 5:33 PM, John Spray <jspray@xxxxxxxxxx> wrote: > On Wed, Oct 21, 2015 at 10:33 PM, John Spray <jspray@xxxxxxxxxx> wrote: >>> John, I know you've got >>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's >>> supposed to be for this, but I'm not sure if you spotted any issues >>> with it or if we need to do some more diagnosing? >> >> That test path is just verifying that we do handle dirs without dying >> in at least one case -- it passes with the existing ceph code, so it's >> not reproducing this issue. > > Clicked send to soon, I was about to add... > > Milosz mentioned that they don't have the data from the system in the > broken state, so I don't have any bright ideas about learning more > about what went wrong here unfortunately. > Sorry about that, wasn't thinking at the time and just wanted to get this up and going as quickly as possible :( If this happens next time I'll be more careful to keep more evidence. I think multi-fs in the same rados namespace support would actually helped here, since it makes it easier to create a newfs and leave the other one around (for investigation) But makes me wonder that the broken dir scenario can probably be replicated by hand using rados calls. There's a pretty generic ticket there for don't die on dir errors, but I imagine the code can be audited and steps to cause a synthetic error can be produced. -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html