On Wed, 30 Mar 2011, Jim Schutt wrote: > > Hmm! That does seem to point at the allocator, doesn't it. > > > > Other threads are doing work during this long interval? Including freeing > > memory, presumably, since basically everything uses the heap one way or > > another. If it's the allocator, it's somehow affecting one thread only, > > which is pretty crazy. > > > > Is it difficult for you to try this with tcmalloc? That'll tell us > > something. > > I finally had a chance to rerun this testing, using > tcmalloc (from google-perftools v1.7) and libatomic_ops (v1.2-2) > against current next branch (commit a2ec936a7cd1c). > > I still get lots of slow RefCountedObject::put calls. Argh... We've verified there's no swapping going on, right? The allocator isn't touching cold pages and waiting for them to be swapped in or something? > > One other possibility would be to try to catch this "in the act" and send it > > a SIGABRT to get a core dump. Then we can look in more detail at what this > > (and other) threads are up to. I'm not sure how easy this is to catch on a > > particular node... > > I'll try this next, assuming that using an assert > in RefCountedObject::put that() "delete this" takes > less than, say, 1 second will catch the state of > the other threads at an interesting place. > > Does that sound OK? I was actually suggesting we try to make it core dump inside the "delete this" and watching for a stall in progress and then sending SIGABRT to dump core in the act. That way we verify it really is in the allocator (and maybe even see where). That's a bit harder to set up, though! Dumping right after may still yield some useful info, but I'm less hopeful... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html