It is great to know anther way to turn the performance up. Instead of OSD, we verify the optimization on RGW side and remarkable performance improvement do be observed. However, it is still be puzzle to us why memory log can impact the performance so much. Thus our team spend some time to investigate it further. Since memory log level reduction is the trigger, the first suspect comes to us is the memory allocation mechanism. A more detailed digest on the shows: - memory allocation for log entry occurs in Log::create_entry() when it is invoked in RGW worker threads. - memory release for log entry occurs in Log::entry() in a dedicated log thread. Since TCMalloc (Thread Caching Malloc) is used and memory log level is high, the thread cache of worker threads will soon be drained due to the memory allocation and the thread cache of the dedicated log thread will be soon overflow due to the memory release. Thus the worker threads have to frequently allocate memory from the Central List of TCMalloc instead of their thread cache while the dedicated log thread has to frequently release its thread cache to the Central List of TCMalloc. As a result, the efficiency of heap management is seriously impacted. To validate this theory, we created the following testing code and the result seems persuasive. ``` // malloc_test.cpp #include <stdio.h> #include <unistd.h> #include <pthread.h> #include <stdlib.h> #include <string.h> #include <assert.h> #define PRODUCER_THREAD_NUM 256 #define CUSTOMER_THREAD_NUM 1 #define ALLOC_SIZE (1024 * 4) #define STACK_SIZE 10240 #define LOOP 10 static pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER; // Simple Stack implementation in case of interference of memory allocatio from STL void *g_stack[STACK_SIZE] = {0}; unsigned g_stack_top = 0; unsigned g_process_per_loop = 0; bool g_stop = false; // Like the worker threads invoking Log::create_entry(), which allocate memory void *run_producer(void * arg){ while(!g_stop){ bool wait = false; void *ptr = NULL; pthread_mutex_lock(&g_mutex); if (g_stack_top < STACK_SIZE){ ptr = malloc(ALLOC_SIZE); if (ptr != NULL){ g_stack[g_stack_top] = ptr; g_stack_top++; } }else{ wait = true; } pthread_mutex_unlock(&g_mutex); if (wait){ usleep(1000); } } return NULL; } // Like the dedicated thread invoking Log::entry(), which release memory void * run_customer(void * arg){ while(!g_stop){ bool wait = false; void *ptr = NULL; pthread_mutex_lock(&g_mutex); if (g_stack_top > 0){ ptr = g_stack[g_stack_top - 1]; if (ptr != NULL){ free(ptr); // bookkeeping, // which will be printed and reset every second in the main thread g_process_per_loop++; } g_stack_top--; }else{ wait = true; } pthread_mutex_unlock(&g_mutex); if (wait){ usleep(1000); } } return NULL; } void create_threads(pthread_t *tids, unsigned thread_num, void*(*start_rtn)(void*) ){ pthread_t tid = 0; pthread_attr_t attr; pthread_attr_init(&attr); for (unsigned i = 0; i < thread_num; ++i){ int32_t ret = pthread_create(&tid, &attr, start_rtn, NULL); tids[i] = 0; if (0 == ret && tid != 0){ tids[i] = tid; } } } void join_threads(pthread_t *tids, unsigned thread_num){ void *ret = NULL; for (unsigned i = 0; i < thread_num; ++i){ if (tids[i] != 0){ pthread_join(tids[i], &ret); } } } int main(){ pthread_t producer_tids[PRODUCER_THREAD_NUM] = {0}; pthread_t customer_tids[CUSTOMER_THREAD_NUM] = {0}; g_stack_top = 0; g_process_per_loop = 0; g_stop = false; // Create threads to allocate memory create_threads(producer_tids, PRODUCER_THREAD_NUM, run_producer); // Create threads to release memory create_threads(customer_tids, CUSTOMER_THREAD_NUM, run_customer); for (unsigned i = 0; i < LOOP; ++i){ unsigned process_num = 0; pthread_mutex_lock(&g_mutex); process_num = g_process_per_loop; g_process_per_loop = 0; pthread_mutex_unlock(&g_mutex); // print out the memory malloc/free operation counts per second printf("%u\n", process_num); sleep(1); } g_stop = true; join_threads(producer_tids, PRODUCER_THREAD_NUM); join_threads(customer_tids, CUSTOMER_THREAD_NUM); return 0; } ``` Here is the result. And apparently, the speed of glibc malloc is twice of that of tcmalloc in this scenario. ``` [root@100 /data/homer]# g++ malloc_test.cpp -lpthread /usr/lib64/libtcmalloc.so.4.2.6 [root@100 /data/homer]# ./a.out 0 467400 <<<<<<<<<<< 467400 malloc/free per second when tcmalloc is used 483353 487908 477326 478355 485330 487226 482805 476528 [root@100 /data/homer]# g++ malloc_test.cpp -lpthread [root@100 /data/homer]# ./a.out 2 857894 1021752 <<<<<<<<<< 1021752 malloc/free per second when glibc malloc is used 1038017 1024181 1020187 1044173 1039201 1042415 1043286 ``` Based on this analysis, our deduction is the following best practices to improve the performance in current implementation of Ceph: - Use large TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES if possible - Use small number of threads if possible (Since TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is evenly shared by threads in a process) - Use lower level of log if possible Hope this analysis is helpful. PS: if we want to keep high level of memory log, we probably need find some way to assure that malloc/free of log entry can be done in the same same thread when TCMalloc is used. Any idea on how to do this? Thanks, Jeegn & Homer @ Tencent Cloud 2018-08-15 5:20 GMT+08:00 Paul Emmerich <paul.emmerich@xxxxxxxx>: > I've also found these log messages to be extremely useful in several situations. > But I guess I could enable them explicitly when debugging something. > > > Paul > > 2018-08-14 20:57 GMT+02:00 Gregory Farnum <gfarnum@xxxxxxxxxx>: >> [ Re-send in plain text. ] >> On Tue, Aug 14, 2018 at 7:41 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>> Currently we default to 0/5, which means we collect log messages on >>> messages sent and received and dump them on crash. The intent is that on >>> a crash we'll have some additional useful context. >>> >>> In practice, I can't remember any instances where this information has >>> been useful for debugging. And Radek's testing is suggesting we get quite >>> a bit of additional performance on the OSD side by turning this down. >>> >>> Has anybody else found this information to be useful in real world >>> debugging situations? >> >> I believe I've used it many times in the past, and I'm a bit surprised >> if you haven't. Seeing the actual messages between daemons is >> invaluable at establishing control flow! >> That said at level 5 we may collect several other things and we could >> perhaps turn the cost of the simple message prints a long way down. Or >> just keep it turned up in teuthology where we actually induce failures >> and let it be on production clusters. >> -Greg > > > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90