Hi Ming, I think I got it now. reset_entry fails to preserve the pj_hash_entry_buf in cached_res. (the key was a red herring, that is also set again later). If the pj_hash_entry_buf isn't preserved, pj_hash_set_np will call find_entry in hash.c and that will not find an entry, although there should be one. In find_entry, after the scan around line 177, *p_entry will contain the beginning of the bucket's list (ht->table[hash & ht->rows]) and a new entry will be taken from the entry_buf (which will also point to the beginning of the bucket's list, because that's where the old entry was and reset_entry should not change the address of that). Then, in line 218, a loop will be created: *p_entry = entry; (*p_entry is the same as entry->next given the current layout). This will create a loop in the bucket's list and either lead to the memory corruption I initially observed or maybe an infinite loop later in case of a hash collision on that hash table. I hope this analysis finally makes sense. This is the current patch: --- a/pjlib-util/src/pjlib-util/resolver.c +++ b/pjlib-util/src/pjlib-util/resolver.c @@ -710,16 +710,19 @@ static struct cached_res *alloc_entry(pj_dns_resolver *resolver) static void reset_entry(struct cached_res **p_cached) { pj_pool_t *pool; + pj_hash_entry_buf hbuf; struct cached_res *cache = *p_cached; unsigned ref_cnt; pool = cache->pool; + pj_memcpy(hbuf, cache->hbuf, sizeof(hbuf)); ref_cnt = cache->ref_cnt; pj_pool_reset(pool); cache = PJ_POOL_ZALLOC_T(pool, struct cached_res); cache->pool = pool; + pj_memcpy(cache->hbuf, hbuf, sizeof(hbuf)); cache->ref_cnt = ref_cnt; *p_cached = cache; } - Lars P.S. Now that I have had a good look at it, I think this bug should be reproducable and shouldn't be platform dependent (not sure why I had sometimes problems reproducing it). Also, if I may say so, pjsip's memory management is complicated. > thanks for your quick answer. > >> The entry update is done a few lines below the code patch you provided above: >> pj_hash_set_np(resolver->hrescache, &cache->key, sizeof(*key), hval, >> cache->hbuf, cache); > > Ah, ok. I missed that. > >> So I'm still not sure whether reset_entry() is the problem or it may be caused by something else. > > I have thought about it some more: reset_entry clears the cache entry's key. pj_hash_set_np then sorts it into the wrong place. > > In my bug-case that lead to a loop in the linked list of the hash table (next pointed to itself). The actual memory corruption happened much later at pjsua_shutdown when the memory of the hash table was reclaimed, and the same memory was freed twice due to the loop in the linked list. > > This patch fixes the problem and is much closer to your original intention: > > --- a/pjlib-util/src/pjlib-util/resolver.c > +++ b/pjlib-util/src/pjlib-util/resolver.c > @@ -710,10 +710,12 @@ static struct cached_res *alloc_entry(pj_dns_resolver *resolver) > static void reset_entry(struct cached_res **p_cached) > { > pj_pool_t *pool; > + struct res_key key; > struct cached_res *cache = *p_cached; > unsigned ref_cnt; > > pool = cache->pool; > + key = cache->key; > ref_cnt = cache->ref_cnt; > > pj_pool_reset(pool); > @@ -721,6 +723,7 @@ static void reset_entry(struct cached_res **p_cached) > cache = PJ_POOL_ZALLOC_T(pool, struct cached_res); > cache->pool = pool; > cache->ref_cnt = ref_cnt; > + cache->key = key; > *p_cached = cache; > } > > Regards, > > - Lars > >> >> On Wed, Aug 21, 2013 at 8:11 PM, Lars Immisch <lars at ibp.de> wrote: >> Hi, >> >> I've seen a memory corruption problem in pjlib-util/resolver.c that was introduced with the changes for #1686 (http://trac.pjsip.org/repos/ticket/1686) >> >> This patch fixes the problem: >> >> --- a/pjlib-util/src/pjlib-util/resolver.c >> +++ b/pjlib-util/src/pjlib-util/resolver.c >> @@ -706,24 +706,6 @@ static struct cached_res *alloc_entry(pj_dns_resolver *resolver) >> return cache; >> } >> >> -/* Re-allocate cache entry, to free cached packet */ >> -static void reset_entry(struct cached_res **p_cached) >> -{ >> - pj_pool_t *pool; >> - struct cached_res *cache = *p_cached; >> - unsigned ref_cnt; >> - >> - pool = cache->pool; >> - ref_cnt = cache->ref_cnt; >> - >> - pj_pool_reset(pool); >> - >> - cache = PJ_POOL_ZALLOC_T(pool, struct cached_res); >> - cache->pool = pool; >> - cache->ref_cnt = ref_cnt; >> - *p_cached = cache; >> -} >> - >> /* Put unused/expired cached entry to the free list */ >> static void free_entry(pj_dns_resolver *resolver, struct cached_res *cache) >> { >> @@ -816,7 +798,7 @@ PJ_DEF(pj_status_t) pj_dns_resolver_start_query( pj_dns_resolver *resolver, >> * the cache (as it has been expired). >> */ >> cache->ref_cnt--; >> - if (cache->ref_cnt <= 0) >> + if (cache->ref_cnt == 0) >> free_entry(resolver, cache); >> >> /* Must return PJ_SUCCESS */ >> @@ -1251,11 +1233,8 @@ static void update_res_cache(pj_dns_resolver *resolver, >> */ >> cache->ref_cnt--; >> cache = alloc_entry(resolver); >> - } else { >> - /* Reset cache to avoid bloated cache pool */ >> - reset_entry(&cache); >> } >> - >> + >> /* Duplicate the packet. >> * We don't need to keep the NS and AR sections from the packet, >> * so exclude from duplication. We do need to keep the Query >> -- >> >> (The second hunk is not strictly necessary, BTW, but ref_cnt is unsigned, so never < 0) >> >> I haven't looked at it very closely, but I believe the problem ist that while reset_entry reallocates the cache entry, it does not refresh the links in the hash table. >> >> There might be a better way to fix this and keep the intended memory usage optimization, but I didn't do it because the problem was hard to reproduce. >> >> - Lars -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 496 bytes Desc: Message signed with OpenPGP using GPGMail URL: <http://lists.pjsip.org/pipermail/pjsip_lists.pjsip.org/attachments/20130823/bed388f9/attachment-0001.asc>