On 2023/03/25 13:12, Tao Liu wrote: > The primary part of the patchset will introduce multithread support for search > cmd to improve its performance. A search operation is mainly made up with 2 > steps: 1) readmem data into pagebuf, 2) search specific values within the > pagebuf. A typical workflow of search is as follows: > > for addr from low to high: > do > readmem(addr, pagebuf) > search_value(value, pagebuf) > addr += pagesize > done > > There are 2 points which we can accelerate: 1) readmem don't have to wait > search_value, when search_value is working, readmem can read the next pagebuf > at the same time. 2) higher addr don't have to wait lower addr, they can be > processed at the same time if we carefully arrange the output order. > > For point 1, we introduce zones for pagebuf, e.g. search_value can work on > zone 0 while readmem can prepare the data for zone 1. For point 2, we introduce > multiple search_value in threads, e.g. readmem will prepare 100 pages as a > batch, then we will have 4 threads of search_value, thread 0 handles page 1~25, > thread 2 handles page 26~50 page, thread 3 handles page 51~75, thread 4 handles > page 76~100. > > A typical workflow of multithread search implemented in this patchset is as > follows, which removed thread synchronization: > > pagebuf[ZONE][BATCH] > zone_index = buf_index = 0 > create_thread(4, search_value) > for addr from low to high: > do > if buf_index < BATCH > readmem(addr, pagebuf[zone_index][buf_index++]) > addr += pagesize > else > start_thread(pagebuf[zone_index], 0/4 * BATCH, 1/4 * BATCH) > start_thread(pagebuf[zone_index], 1/4 * BATCH, 2/4 * BATCH) > start_thread(pagebuf[zone_index], 2/4 * BATCH, 3/4 * BATCH) > start_thread(pagebuf[zone_index], 3/4 * BATCH, 4/4 * BATCH) > zone_index++ > buf_index = 0 > fi > done > > readmem works in the main process and not multi-threaded, because readmem will > not only read data from vmcore, decompress it, but walk through page tables if > virtual address given. It is hard to reimplement it into thread safe version, > search_value is easier to be made thread-safe. By carefully choose batch size > and thread num, we can maximize the concurrency. > > The last part of the patchset, is replacing lseek/read to pread for kcore and > diskdumped vmcore. > > Here is the performance test result chart. Please note the vmcore and > kcore are tested seperately on 2 different machines. crash-orig is the > crash compiled with clean upstream code, crash-pread is the code with only > pread patch applied(patch 5), crash-multi is the code with only multithread > patches applied(patch 1~4). > > ulong search: > > $ time echo "search abcd" | ./crash-orig vmcore vmlinux > /dev/null > $ time echo "search abcd -f 4 -n 4" | ./crash-multi vmcore vmlinux > /dev/null > > 45G vmcore 64G kcore > real user sys real user sys > crash-orig 16m56.595s 15m57.188s 0m56.698s 1m37.982s 0m51.625s 0m46.266s > crash-pread 16m46.366s 15m55.790s 0m48.894s 1m9.179s 0m36.646s 0m32.368s > crash-multi 16m26.713s 19m8.722s 1m29.263s 1m27.661s 0m57.789s 0m54.604s > > string search: > > $ time echo "search -c abcddbca" | ./crash-orig vmcore vmlinux > /dev/null > $ time echo "search -c abcddbca -f 4 -n 4" | ./crash-multi vmcore vmlinux > /dev/null > > 45G vmcore 64G kcore > real user sys real user sys > crash-orig 33m33.481s 32m38.321s 0m52.771s 8m32.034s 7m50.050s 0m41.478s > crash-pread 33m25.623s 32m35.019s 0m47.394s 8m4.347s 7m35.352s 0m28.479s > crash-multi 16m31.016s 38m27.456s 1m11.048s 5m11.725s 7m54.224s 0m44.186s > > Discussion: > > 1) Either multithread and pread patches can improve the performance a > bit, so if both patches applied, the performance can be better. > > 2) Multi-thread search performs much better in search time consumptive > tasks, such as string search. Thank you for the improvement! sorry, I've not had time to see this and the cmd_search() code yet due to other tasks.. I think that multithreading is physically needed to speed up the search processing for a huge memory beyond a certain level, and we may have it ultimately. So nice work, but it's complicated, I'm still not sure whether it should be done first. I would like to see whether there is room for optimization, cleanup or etc. So if you have any analysis or trial result you did, please let me know. Thanks, Kazu -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/crash-utility Contribution Guidelines: https://github.com/crash-utility/crash/wiki