Re: defaulting debug_ms=0

Jeegn Chen <jeegnchen@xxxxxxxxx> · Thu, 16 Aug 2018 14:38:36 +0800

It is great to know anther way to turn the performance up.

Instead of OSD, we verify the optimization on RGW side and remarkable
performance improvement do be observed.

However, it is still be puzzle to us why memory log can impact the
performance so much. Thus our team spend some time to investigate it
further. Since memory log level reduction is the trigger, the first
suspect comes to us is the memory allocation mechanism. A more
detailed digest on the shows:

- memory allocation for log entry occurs in Log::create_entry() when
it is invoked in RGW worker threads.
- memory release for log entry occurs in Log::entry() in a dedicated log thread.

Since TCMalloc (Thread Caching Malloc) is used and memory log level is
high, the thread cache of worker threads will soon be drained  due to
the memory allocation and the thread cache of the dedicated log thread
will be soon overflow due to the memory release. Thus the worker
threads have to frequently allocate memory from the Central List of
TCMalloc instead of their thread cache while the dedicated log thread
has to frequently release its thread cache to the Central List of
TCMalloc. As a result, the efficiency of heap management is seriously
impacted.

To validate this theory, we created the following testing code and the
result seems persuasive.

```
// malloc_test.cpp
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>

#define PRODUCER_THREAD_NUM 256
#define CUSTOMER_THREAD_NUM 1
#define ALLOC_SIZE (1024 * 4)
#define STACK_SIZE 10240
#define LOOP 10

static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;
// Simple Stack implementation in case of interference of memory
allocatio from STL
void *g_stack[STACK_SIZE] = {0};
unsigned g_stack_top = 0;
unsigned g_process_per_loop = 0;
bool g_stop = false;

// Like the worker threads invoking Log::create_entry(), which allocate memory
void *run_producer(void * arg){
while(!g_stop){
bool wait = false;
void *ptr = NULL;
pthread_mutex_lock(&g_mutex);
if (g_stack_top < STACK_SIZE){
ptr = malloc(ALLOC_SIZE);
if (ptr != NULL){
g_stack[g_stack_top] = ptr;
g_stack_top++;
}
}else{
wait = true;
}
pthread_mutex_unlock(&g_mutex);
if (wait){
usleep(1000);
}
}

return NULL;
}

// Like the dedicated thread invoking Log::entry(), which release memory
void * run_customer(void * arg){
while(!g_stop){
bool wait = false;
void *ptr = NULL;
pthread_mutex_lock(&g_mutex);
if (g_stack_top > 0){
ptr = g_stack[g_stack_top - 1];
if (ptr != NULL){
free(ptr);
// bookkeeping,
// which will be printed and reset every second in the main thread
g_process_per_loop++;
}
g_stack_top--;
}else{
wait = true;
}
pthread_mutex_unlock(&g_mutex);
if (wait){
usleep(1000);
}
}

return NULL;
}

void create_threads(pthread_t *tids, unsigned thread_num,
void*(*start_rtn)(void*) ){
pthread_t tid = 0;
pthread_attr_t attr;
pthread_attr_init(&attr);
for (unsigned i = 0; i < thread_num; ++i){
int32_t ret = pthread_create(&tid, &attr, start_rtn, NULL);

tids[i] = 0;
if (0 == ret && tid != 0){
tids[i] = tid;
}
}
}

void join_threads(pthread_t *tids, unsigned thread_num){
void *ret = NULL;
for (unsigned i = 0; i < thread_num; ++i){
if (tids[i] != 0){
pthread_join(tids[i], &ret);
}
}
}

int main(){
pthread_t producer_tids[PRODUCER_THREAD_NUM] = {0};
pthread_t customer_tids[CUSTOMER_THREAD_NUM] = {0};
g_stack_top = 0;
g_process_per_loop = 0;
g_stop = false;
  // Create threads to allocate memory
create_threads(producer_tids, PRODUCER_THREAD_NUM, run_producer);
// Create threads to release memory
create_threads(customer_tids, CUSTOMER_THREAD_NUM, run_customer);
for (unsigned i = 0; i < LOOP; ++i){
unsigned process_num = 0;
pthread_mutex_lock(&g_mutex);
process_num = g_process_per_loop;
g_process_per_loop = 0;
pthread_mutex_unlock(&g_mutex);
// print out the memory malloc/free operation counts per second
printf("%u\n", process_num);
sleep(1);
}
g_stop = true;

join_threads(producer_tids, PRODUCER_THREAD_NUM);
join_threads(customer_tids, CUSTOMER_THREAD_NUM);

return 0;
}
```

Here is the result. And apparently, the speed of glibc malloc is twice
of that of tcmalloc  in this scenario.

```
[root@100 /data/homer]# g++ malloc_test.cpp -lpthread
/usr/lib64/libtcmalloc.so.4.2.6

[root@100 /data/homer]# ./a.out
0
467400 <<<<<<<<<<< 467400 malloc/free per second when tcmalloc is used
483353
487908
477326
478355
485330
487226
482805
476528
[root@100 /data/homer]# g++ malloc_test.cpp -lpthread

[root@100 /data/homer]# ./a.out
2
857894
1021752 <<<<<<<<<< 1021752 malloc/free per second when glibc malloc is used
1038017
1024181
1020187
1044173
1039201
1042415
1043286
```

Based on this analysis, our deduction is the following best practices
to improve the performance in current implementation of Ceph:

- Use large TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES if possible
- Use small number of threads if possible (Since
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is evenly shared by threads in a
process)
- Use lower level of log if possible

Hope this analysis is helpful.

PS: if we want to keep high level of memory log, we probably need find
some way to assure that malloc/free of log entry can be done in the
same same thread when TCMalloc is used. Any idea on how to do this?

Thanks,

Jeegn & Homer @ Tencent Cloud

2018-08-15 5:20 GMT+08:00 Paul Emmerich <paul.emmerich@xxxxxxxx>:
> I've also found these log messages to be extremely useful in several situations.
> But I guess I could enable them explicitly when debugging something.
>
>
>    Paul
>
> 2018-08-14 20:57 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>:
>> [ Re-send in plain text. ]
>> On Tue, Aug 14, 2018 at 7:41 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>> Currently we default to 0/5, which means we collect log messages on
>>> messages sent and received and dump them on crash.  The intent is that on
>>> a crash we'll have some additional useful context.
>>>
>>> In practice, I can't remember any instances where this information has
>>> been useful for debugging.  And Radek's testing is suggesting we get quite
>>> a bit of additional performance on the OSD side by turning this down.
>>>
>>> Has anybody else found this information to be useful in real world
>>> debugging situations?
>>
>> I believe I've used it many times in the past, and I'm a bit surprised
>> if you haven't. Seeing the actual messages between daemons is
>> invaluable at establishing control flow!
>> That said at level 5 we may collect several other things and we could
>> perhaps turn the cost of the simple message prints a long way down. Or
>> just keep it turned up in teuthology where we actually induce failures
>> and let it be on production clusters.
>> -Greg
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90