Re: Potential OSD deadlock?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 22 Sep 2015 07:31:45 -0700

On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Is there some way to tell in the logs that this is happening?

You can search for the (mangled) name _split_collection
> I'm not
> seeing much I/O, CPU usage during these times. Is there some way to
> prevent the splitting? Is there a negative side effect to doing so?

Bump up the split and merge thresholds. You can search the list for
this, it was discussed not too long ago.

> We've had I/O block for over 900 seconds and as soon as the sessions
> are aborted, they are reestablished and complete immediately.
>
> The fio test is just a seq write, starting it over (rewriting from the
> beginning) is still causing the issue. I was suspect that it is not
> having to create new file and therefore split collections. This is on
> my test cluster with no other load.

Hmm, that does make it seem less likely if you're really not creating
new objects, if you're actually running fio in such a way that it's
not allocating new FS blocks (this is probably hard to set up?).

>
> I'll be doing a lot of testing today. Which log options and depths
> would be the most helpful for tracking this issue down?

If you want to go log diving "debug osd = 20", "debug filestore = 20",
"debug ms = 1" are what the OSD guys like to see. That should spit out
everything you need to track exactly what each Op is doing.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com