Re: [RFC PATCH 1/2] add 99memdebug-ko dracut module

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Xunlei,
On 11/02/16 at 05:03pm, Xunlei Pang wrote:
> The current method for memory debug is to use "rd.memdebug=[0-3]",
> it is not enough for debugging kernel modules. For example, when we
> want to find out which kernel module consumes a large amount of memory,
> "rd.memdebug=[0-3]" won't help too much.
> 
> A better way is needed to achieve this requirement, this is useful for
> kdump OOM debugging.
> 
> The principle of this patch is to use kernel trace to track slab and
> buddy allocation calls during kernel module loading(module_init), thus
> we can analyze all the trace data and get the total memory consumption.
> As for large slab allocation, it will probably fall into buddy allocation,
> thus tracing "mm_page_alloc" alone should be enough for the purpose(this
> saves quite some trace buffer memory, also large free is quite unlikey
> during module loading, we neglect those memory free events).
> 
> The trace events include memory calls under "tracing/events/":
>   kmem/mm_page_alloc
> 
> We also inpect the following events to detect the module loading:
>   module/module_load
>   module/module_put
> 
> Since we use filters to trace events, the final trace data size won't
> be too big. Users can adjust the trace buffer size via "trace_buf_size"
> kernel boot command line as needed.
> 
> We can get the module name and task pid from "module_load" event which
> also mark the beginning of the loading, and module_put called by the
> same task pid implies the end of the loading. So the memory events
> recorded in between by the same task pid are consumed by this module
> during loading(i.e. modprobe or module_init()).
> 
> With these information, we can record the total memory(the larger, the
> more precise the result will be) consumption involved by each kernel
> module loading.
> 
> Thus we introduce this dracut module to find out which kernel module
> consumes a large amount of memory during loading. It uses "rd.memdebug=4"
> as the tigger.
> 
> After applying is patch and specifying "rd.memdebug=4", during booting
> it will print out something below:
> == debug_mem for kernel modules during loading begin ==
> 0 pages consumed by "pata_acpi"
> 0 pages consumed by "ata_generic"
> 1 pages consumed by "drm"
> 0 pages consumed by "ttm"
> 0 pages consumed by "drm_kms_helper"
> 835 pages consumed by "qxl"
> 0 pages consumed by "mii"
> 6 pages consumed by "8139cp"
> 0 pages consumed by "virtio"
> 0 pages consumed by "virtio_ring"
> 9 pages consumed by "virtio_pci"
> 1 pages consumed by "8139too"
> 0 pages consumed by "serio_raw"
> 0 pages consumed by "crc32c_intel"
> 199 pages consumed by "virtio_console"
> 0 pages consumed by "libcrc32c"
> 9 pages consumed by "xfs"
> == debug_mem for kernel modules during loading end ==
> 
> From the print, we see clearly that "qxl" consumed lots of memory.
> 
> Signed-off-by: Xunlei Pang <xlpang@xxxxxxxxxx>
> ---
>  dracut.spec                                    |  1 +
>  modules.d/99memdebug-ko/memdebug-ko-parse.sh   | 93 ++++++++++++++++++++++++++
>  modules.d/99memdebug-ko/memdebug-ko-prepare.sh | 60 +++++++++++++++++
>  modules.d/99memdebug-ko/module-setup.sh        | 25 +++++++
>  4 files changed, 179 insertions(+)
>  create mode 100755 modules.d/99memdebug-ko/memdebug-ko-parse.sh
>  create mode 100755 modules.d/99memdebug-ko/memdebug-ko-prepare.sh
>  create mode 100755 modules.d/99memdebug-ko/module-setup.sh
> 
> diff --git a/dracut.spec b/dracut.spec
> index 7b9d675..3907b17 100644
> --- a/dracut.spec
> +++ b/dracut.spec
> @@ -407,6 +407,7 @@ rm -rf -- $RPM_BUILD_ROOT
>  %{dracutlibdir}/modules.d/98usrmount
>  %{dracutlibdir}/modules.d/99base
>  %{dracutlibdir}/modules.d/99fs-lib
> +%{dracutlibdir}/modules.d/99memdebug-ko
>  %{dracutlibdir}/modules.d/99shutdown
>  %attr(0644,root,root) %ghost %config(missingok,noreplace) %{_localstatedir}/log/dracut.log
>  %dir %{_sharedstatedir}/initramfs
> diff --git a/modules.d/99memdebug-ko/memdebug-ko-parse.sh b/modules.d/99memdebug-ko/memdebug-ko-parse.sh
> new file mode 100755
> index 0000000..1542658
> --- /dev/null
> +++ b/modules.d/99memdebug-ko/memdebug-ko-parse.sh
> @@ -0,0 +1,93 @@
> +# Parse the trace data to find out the modules consuming large amount of memory.
> +
> +is_trace_data_available() {
> +    # Only "rd.memdebug=4" is for the purpose.
> +    if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then
> +        return 1
> +    fi

The usage of rd.memdebug is different in other place, can we integrate
it to the original rd.memdbug code?

One concern it in this patch only parse the trace data at pre-pivot
hook, so if oom happens earlier than that we will get nothing.

Original rd.memdebug being added to several different point, if we
integrate this to the framework then we can get the out put early.

> +
> +    TRACE_BASE="/sys/kernel/debug"
> +    # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available.
> +    if [[ -d "/sys/kernel/tracing" ]]; then
> +        TRACE_BASE="/sys/kernel"
> +    fi
> +
> +    if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
> +        warn "Trace inactive, skipping kernel module memory analyzing."
> +        return 1
> +    fi

This function is only for checking if tracing is needed by checking
cmdline, so getting the TRACE_BASE logic can be moved to
parse_trace_data so that it can be a local variable..

> +
> +    return 0
> +}
> +
> +parse_trace_data() {
> +    # Indexed by task pid.
> +    declare -A current_module
> +    # Indexed by module name.
> +    declare -A module_loaded
> +    declare -A nr_alloc_pages
> +
> +    # For large trace data, parsing "tracing/trace" directly turns out to be time-consuming,
> +    # hence copy it out first, then parse the copy file instead.

Have you get any data about the time used via traceing/trace parsing, if
the time is still acceptable then it will be better to avoid the tmp
file.

> +    TMPFILE=/tmp/kdump.trace.tmp.$$$$
> +    cp $TRACE_BASE/tracing/trace $TMPFILE -f
> +    while read pid cpu flags ts function ;
> +    do
> +        # Skip comment lines
> +        if [[ $pid = "#" ]]; then
> +            continue
> +        fi
> +
> +        if [[ $function = module_load* ]]; then
> +            # One module is being loaded, save the task pid for tracking.
> +            module_name=${function#*: }
> +            # Remove the trailing after whitespace, there may be the module flags.
> +            module_name=${module_name%% *}
> +            module_names+=" $module_name"
> +            current_module[$pid]="$module_name"
> +            [[ ${module_loaded[$module_name]} ]] && warn "\"$module_name\" was loaded multiple times!"
> +            unset module_loaded[$module_name]
> +            nr_alloc_pages[$module_name]=0
> +        fi
> +    
> +        if ! [[ ${current_module[$pid]} ]]; then
> +            continue
> +        fi
> +
> +        if [[ $function = module_put* ]]; then
> +            # Mark the module as loaded when the first module_put event happens after module_load.
> +            module_loaded[${current_module[$pid]}]=1
> +            # Module has been loaded when module_put is called, untrack the task
> +            unset current_module[$pid]
> +            continue
> +        fi
> +    
> +        # Once we get here, the task is being tracked(is loading a module).
> +        # Get the module name.
> +        module_name=${current_module[$pid]}
> +    
> +        if [[ $function = mm_page_alloc* ]]; then
> +            order=$(echo $function | sed -e 's/.*order=\([0-9]*\) .*/\1/')
> +            nr_alloc_pages[$module_name]=$((${nr_alloc_pages[$module_name]}+$((2 ** $order))))
> +        fi
> +    done < $TMPFILE
> +    
> +    # Print the result out.
> +    echo "== debug_mem for kernel modules during loading begin ==" >&2
> +    for i in $module_names; do
> +        status=""
> +        if ! [[ ${module_loaded[$i]} ]]; then
> +            status="[still loading]"
> +        fi
> +        echo "${nr_alloc_pages[$i]} pages consumed by \"$i\" $status" >&2
> +    done
> +    echo "== debug_mem for kernel modules during loading end ==" >&2
> +    
> +    unset module_names
> +    unset module_loaded

unset nr_alloc_pages?

> +    rm $TMPFILE -f
> +}
> +
> +if is_trace_data_available ; then
> +    parse_trace_data
> +fi
> diff --git a/modules.d/99memdebug-ko/memdebug-ko-prepare.sh b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh
> new file mode 100755
> index 0000000..134821a
> --- /dev/null
> +++ b/modules.d/99memdebug-ko/memdebug-ko-prepare.sh
> @@ -0,0 +1,60 @@
> +# Try to find out kernel modules with large total memory allocation during loading.
> +# For large slab allocation, it will fall into buddy, thus tracing "mm_page_alloc"
> +# alone should be enough for the purpose.
> +#
> +# Prepare for the trace setup.
> +
> +prepare_trace() {
> +    # Only "rd.memdebug=4" is for the purpose.
> +    if [[ $(getargnum 0 0 4 rd.memdebug) != 4 ]]; then
> +        return 0
> +    fi
> +
> +    TRACE_BASE="/sys/kernel/debug"
> +    # trace access through debugfs would be obsolete if "/sys/kernel/tracing" is available.
> +    if [[ -d "/sys/kernel/tracing" ]]; then
> +        TRACE_BASE="/sys/kernel"
> +    fi

Move TRACE_BASE as a local variable, maybe a function..

> +    
> +    # old debugfs case.
> +    if ! [[ -d "$TRACE_BASE/tracing" ]]; then
> +        mount none -t debugfs $TRACE_BASE
> +    # new tracefs case.
> +    elif ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
> +        mount none -t tracefs "$TRACE_BASE/tracing"
> +    fi
> +    
> +    if ! [[ -f "$TRACE_BASE/tracing/trace" ]]; then
> +        warn "Mount trace failed for kernel module memory analyzing."
> +        return 1
> +    fi
> +    
> +    MATCH_EVENTS="module:module_put module:module_load kmem:mm_page_alloc"

Do we have module_init event? what is the difference between module_load
and module_init?

> +    SET_EVENTS=$(echo $(cat $TRACE_BASE/tracing/set_event))
> +    # Check if trace was properly setup, prepare it if not.
> +    if [[ $(cat $TRACE_BASE/tracing/tracing_on) != 1 ]] || \
> +        [[ "$SET_EVENTS" != "$MATCH_EVENTS" ]]; then
> +        # Set our trace events.
> +        echo $MATCH_EVENTS > $TRACE_BASE/tracing/set_event
> +    
> +        # There are three kinds of known applications for module loading:
> +        # "systemd-udevd", "modprobe" and "insmod".
> +        # Set them to the mm_page_alloc event filter.
> +        # NOTE: Some kernel may not support this format of filter, anyway
> +        #       the operation will fail and it doesn't matter.
> +        page_alloc_filter="comm == systemd-udevd || comm == modprobe || comm == insmod"
> +        echo $page_alloc_filter > $TRACE_BASE/tracing/events/kmem/mm_page_alloc/filter
> +        
> +        # Set the number of comm-pid.
> +        # Thanks to filters, 4096 is big enough(also well supported).
> +        echo 4096 > $TRACE_BASE/tracing/saved_cmdlines_size
> +
> +        # Enable and clear trace data for the first time.
> +        echo 1 > $TRACE_BASE/tracing/tracing_on
> +        echo > $TRACE_BASE/tracing/trace
> +    fi
> +
> +    return 0
> +}
> +
> +prepare_trace
> diff --git a/modules.d/99memdebug-ko/module-setup.sh b/modules.d/99memdebug-ko/module-setup.sh
> new file mode 100755
> index 0000000..d6a7f0c
> --- /dev/null
> +++ b/modules.d/99memdebug-ko/module-setup.sh
> @@ -0,0 +1,25 @@
> +#!/bin/bash
> +
> +# called by dracut
> +check() {
> +    # this is very small, always add it.
> +    # It has no effect if no "rd.memdebug=4" is specified, also
> +    # this stays consistent with "rd.memdebug=[0-3]".
> +    return 0
> +}
> +
> +# called by dracut
> +depends() {
> +    return 0
> +}
> +
> +# called by dracut
> +install() {
> +    # Prepare trace at the early stage before any ko modules is loaded.
> +    inst_hook cmdline 00 "$moddir/memdebug-ko-prepare.sh"
> +
> +    # Before switching root to the real root, rootfs should be mounted
> +    # and all dependent ko modules are expected to be loaded.
> +    # We can parse trace data at this stage.
> +    inst_hook pre-pivot 99 "$moddir/memdebug-ko-parse.sh"
> +}
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe initramfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe initramfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux