Re: cosd multi-second stalls cause "wrongly marked me down"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sage,

Sage Weil wrote:
On Fri, 11 Mar 2011, Jim Schutt wrote:
On Fri, 2011-03-11 at 11:37 -0700, Sage Weil wrote:
On Fri, 11 Mar 2011, Jim Schutt wrote:
So none of those were osd_ping messages.

But, I still had lots of delayed acks.  Here's a couple more examples:



Hmm!  That does seem to point at the allocator, doesn't it.

Other threads are doing work during this long interval? Including freeing memory, presumably, since basically everything uses the heap one way or another. If it's the allocator, it's somehow affecting one thread only, which is pretty crazy.

Is it difficult for you to try this with tcmalloc? That'll tell us something.

I finally had a chance to rerun this testing, using
tcmalloc (from google-perftools v1.7) and libatomic_ops (v1.2-2)
against current next branch (commit a2ec936a7cd1c).

I still get lots of slow RefCountedObject::put calls.


One other possibility would be to try to catch this "in the act" and send it a SIGABRT to get a core dump. Then we can look in more detail at what this (and other) threads are up to. I'm not sure how easy this is to catch on a particular node...

I'll try this next, assuming that using an assert
in RefCountedObject::put that() "delete this" takes
less than, say, 1 second will catch the state of
the other threads at an interesting place.

Does that sound OK?

-- Jim


sage



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux