Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx> writes: > Add a new system state document to the admin-guide. This document is > intended to be used as a guide on how to gather higher level information > about a system and its run-time activity. > > Signed-off-by: Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx> > --- > Changes since v1: > -- Addressed review comments > > Documentation/admin-guide/index.rst | 1 + > Documentation/admin-guide/system-state.rst | 350 +++++++++++++++++++++ > 2 files changed, 351 insertions(+) > create mode 100644 Documentation/admin-guide/system-state.rst > > diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst > index f475554382e2..541372672c55 100644 > --- a/Documentation/admin-guide/index.rst > +++ b/Documentation/admin-guide/index.rst > @@ -66,6 +66,7 @@ subsystems expectations will be found here. > :maxdepth: 1 > > workload-tracing > + system-state > > The rest of this manual consists of various unordered guides on how to > configure specific aspects of kernel behavior to your liking. > diff --git a/Documentation/admin-guide/system-state.rst b/Documentation/admin-guide/system-state.rst > new file mode 100644 > index 000000000000..2a6fdf85c35c > --- /dev/null > +++ b/Documentation/admin-guide/system-state.rst > @@ -0,0 +1,350 @@ > +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) > + > +=========================================================== > +Discovering system calls and features supported on a system > +=========================================================== > + > +:Author: Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx> > +:maintained-by: Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx> Rather than adding lines like this, I think everybody would be better served with a MAINTAINERS file entry. get_maintainer.pl doesn't know about these lines. > +Key Points > +========== > + > + * System state includes system calls, features, static and dynamic > + modules enabled in the kernel configuration. > + * Supported system calls and Kernel features are architecture dependent. > + * auditd, checksyscalls.sh, and get_feat.pl tools can be used to discover > + static system state. > + * Understanding Linux kernel hardening configurations options and making > + sure they are enabled will make a system more secure. > + * Employing run-time tracing can shed light on the dynamic system state. > + * Workloads could change the system state by loading and unloading dynamic > + modules and tuning system parameters. So what I'm missing, before this even, is a paragraph saying what this document is actually for. Who is the intended audience, and why might they want to read this document? > +System State Visualization > +========================== > + > +The kernel system state can be viewed as a combination of static and > +dynamic features and modules. Let’s first define what static and dynamic > +system states are and then explore how we can visualize the static and > +dynamic system parts of the kernel. > + > +Static System View comprises system calls, features, static and dynamic > +modules enabled in the kernel configuration. Supported system calls So the "static system view" includes *dynamic* modules? Fine if that's what you intended, but it reads a bit strangely. > +and Kernel features are architecture dependent. System call numbering is > +different on different architectures. We can get the supported system call > +information using auditd utilities. > + > +ausyscall –dump prints out the supported system calls on a system and allows Some clever software turned your "--" into an em-dash here. > +mapping syscall names and numbers. You can install the auditd package on > +Debian based systems:: > + > + sudo apt-get install auditd > + > +scripts/checksyscalls.sh can be used to check if current architecture is > +missing any system calls compared to i386. > + > +scripts/get_feat.pl can be used to list the Kernel feature support matrix > +for an architecture. > + > +Dynamic System View comprises system calls, ioctls invoked, and subsystems > +used during the runtime. A workload could load and unload modules and also > +change the dynamic system configuration to suit its needs by tuning system > +parameters. > + > +What is the methodology? > +======================== > + > +The first step is gathering the default system state such as the dynamic > +and static modules loaded on the system. lsmod command prints out the *The* lsmod command > +dynamically loaded modules on a system. Statically configured modules can > +be found in the kernel configuration file. > + > +The next step is discovering system activity during run-time. You can do so > +by enabling event tracing and then running your favorite application. After > +a period of time, gather the event logs, and kernel messages. Might your intended readers need a hint on enabling tracing? A cross reference to the appropriate docs if nothing else. [Later I see you get to this; adding an "as described below" would help here.] > +Once you have the necessary information, you can extract the system call > +numbers from the event trace log and map them to the supported system calls. > + > +Finding supported system calls > +============================== > + > +As mentioned earlier, ausyscall prints out supported system calls > +on a system and allows mapping syscalls names and numbers:: > + > + ausyscall --dump > + > +You can look for specific system calls as shown in the below:: > + > + ausyscall open > + open 2 > + mq_open 240 > + openat 257 > + perf_event_open 298 > + open_by_handle_at 304 > + open_tree 428 > + fsopen 430 > + pidfd_open 434 > + openat2 437 > + > + ausyscall time > + > + getitimer 36 > + setitimer 38 > + gettimeofday 96 > + times 100 > + rt_sigtimedwait 128 > + utime 132 > + adjtimex 159 > + settimeofday 164 > + time 201 > + semtimedop 220 > + timer_create 222 > + timer_settime 223 > + timer_gettime 224 > + timer_getoverrun 225 > + timer_delete 226 > + clock_settime 227 > + clock_gettime 228 > + utimes 235 > + mq_timedsend 242 > + mq_timedreceive 243 > + futimesat 261 > + utimensat 280 > + timerfd_create 283 > + timerfd_settime 286 > + timerfd_gettime 287 > + clock_adjtime 305 > + > +Finding unsupported system calls > +================================ > + > +As mentioned earlier, scripts/checksyscalls.sh checks missing system calls > +on current architecture compared to i386. Example run:: > + > + checksyscalls.sh gcc > + warning: #warning syscall mmap2 not implemented [-Wcpp] > + warning: #warning syscall truncate64 not implemented [-Wcpp] > + warning: #warning syscall ftruncate64 not implemented [-Wcpp] > + warning: #warning syscall fcntl64 not implemented [-Wcpp] > + warning: #warning syscall sendfile64 not implemented [-Wcpp] > + warning: #warning syscall statfs64 not implemented [-Wcpp] > + warning: #warning syscall fstatfs64 not implemented [-Wcpp] > + warning: #warning syscall fadvise64_64 not implemented [-Wcpp] > + > +Let's check this against ausyscall now:: > + > + ausyscall map > + mmap 9 > + munmap 11 > + mremap 25 > + remap_file_pages 216 > + > + ausyscall trunc > + truncate 76 > + ftruncate 77 > + > +As you can see, ausyscall shows mmap2, truncate64, and ftruncate64 aren't > +implemented on this system. This matches what checksyscalls.sh shows. > + > +Finding supported features > +========================== > + > +scripts/get_feat.pl can be used to list the Kernel feature support matrix > +for an architecture:: > + > + get_feat.pl list > + get_feat.pl list –arch=arm64 lists Lost the "--" again here > +This scripts parses Documentation/features to find the support status script (singular) > +information. It can be used to validate the contents of the files under > +Documentation/features or simply list them:: > + > + --arch Outputs features for an specific architecture, optionally filtering > + for a single specific feature. > + --feat or --feature Output features for a single specific feature. > + > +Here is how you can find if stackprotector and hread-info-in-task features and *thread*-info-in-task > +are supported:: > + > + scripts/get_feat.pl --arch=arm64 --feat=stackprotector list > + # > + # Kernel feature support matrix of the 'arm64' architecture: > + # > + debug/ stackprotector : ok | HAVE_STACKPROTECTOR # > + arch supports compiler driven stack overflow protection > + > + scripts/get_feat.pl --feat=thread-info-in-task list > + # > + # Kernel feature support matrix of the 'x86' architecture: > + # > + core/ thread-info-in-task : ok | THREAD_INFO_IN_TASK # > + arch makes use of the core kernel facility to embed thread_info in > + task_struct > + > +Finding kernel module status > +============================ > + > +lsmod command shows the kernel modules that are currently loaded. This > +program displays the contents of /proc/modules. Let's pick uvcvideo *The* lsmod *the* uvcvideo > +module which is found on most laptops:: > + > + lsmod | grep uvc > + uvcvideo 126976 0 > + videobuf2_vmalloc 20480 1 uvcvideo > + uvc 16384 1 uvcvideo > + videobuf2_v4l2 36864 1 uvcvideo > + videodev 315392 2 videobuf2_v4l2,uvcvideo > + videobuf2_common 65536 4 videobuf2_vmalloc,videobuf2_v4l2,uvcvideo,videobuf2_memops > + mc 77824 4 videodev,videobuf2_v4l2,uvcvideo,videobuf2_common > + > +You can see that lsmod shows uvcvideo and the modules it depends on and how > +many modules are using them. videobuf2_common is in use by 4 other modules. > +In other words, this is the reference count for this module and rmmod will > +refuse to unload it as long as the reference count is > 0. > + > +You can get the same information from /proc.modules:: > + > + less /proc/modules | grep uvc why not just "grep uvc /proc/modules" ? > + uvcvideo 126976 0 - Live 0x0000000000000000 > + videobuf2_vmalloc 20480 1 uvcvideo, Live 0x0000000000000000 > + uvc 16384 1 uvcvideo, Live 0x0000000000000000 > + videobuf2_v4l2 36864 1 uvcvideo, Live 0x0000000000000000 > + videodev 315392 2 uvcvideo,videobuf2_v4l2, Live 0x0000000000000000 > + videobuf2_common 65536 4 uvcvideo,videobuf2_vmalloc,videobuf2_memops,videobuf2_v4l2, Live 0x0000000000000000 > + mc 77824 4 uvcvideo,videobuf2_v4l2,videodev,videobuf2_common, Live 0x0000000000000000 > + > +The information is similar with a few more extra fields. The address is the > +base address for the module in kernel virtual memory space. When run as a > +normal user, the address is all zeros. The same command when run as root will > +be as follows:: > + > + sudo less /proc/modules | grep uvc > + uvcvideo 126976 0 - Live 0xffffffffc1c8b000 > + videobuf2_vmalloc 20480 1 uvcvideo, Live 0xffffffffc167f000 > + uvc 16384 1 uvcvideo, Live 0xffffffffc0ab0000 > + videobuf2_v4l2 36864 1 uvcvideo, Live 0xffffffffc0a28000 > + videodev 315392 2 uvcvideo,videobuf2_v4l2, Live 0xffffffffc16e9000 > + videobuf2_common 65536 4 uvcvideo,videobuf2_vmalloc,videobuf2_memops,videobuf2_v4l2, Live 0xffffffffc094d000 > + mc 77824 4 uvcvideo,videobuf2_v4l2,videodev,videobuf2_common, Live 0xffffffffc15eb000 > + > +Let's check what modinfo shows that is important for us:: > + > + /sbin/modinfo uvcvideo > + filename: /lib/modules/6.3.0-rc2/kernel/drivers/media/usb/uvc/uvcvideo.ko > + license: GPL > + description: USB Video Class driver > + depends: videobuf2-v4l2,videodev,mc,uvc,videobuf2-common,videobuf2-vmalloc > + retpoline: Y > + intree: Y > + name: uvcvideo > + vermagic: 6.3.0-rc2 SMP preempt mod_unload modversions > + sig_id: PKCS#7 > + signer: Build time autogenerated kernel key > + > +This tells us that this module is built intree and the signed with a build > +time autogenerated key. > + > +Let's do one last sanity check on the system to see if the following two > +command outputs match:: > + > + ps ax | wc -l > + ls -d /proc/* | grep [0-9]|wc -l > + > +If they don't match, examine your system closely. kernel rootkits install > +their own ps, find, etc. utilities to mask their activity. The outputs > +match on my system. Do they on yours? This would assume that there is no other activity on the system, of course. Worth saying to avoid unnecessary panic. > +Is my system as secure as it could be? > +====================================== > + > +Linux kernel supports several hardening options to make system secure. *The* Linux kernel ... to make *the* system secure the whole document could use a pass for article use > +kconfig-hardened-check tool sanity checks kernel configuration for > +security. You can clone the latest kconfig-hardened-check repository:: > + > + git clone https://github.com/a13xp0p0v/kconfig-hardened-check.git > + cd kconfig-hardened-check > + bin/kconfig-hardened-check --config <config file> --cmdline /proc/cmdline Should you say what <config file> is? > +This will generate detailed report of kernel security configuration and > +command line options that are enabled (OK) and the ones that aren't (FAIL) > +and a summary line at the end:: > + > + [+] Config check is finished: 'OK' - 100 / 'FAIL' - 100 > + > +You will have to analyze the information to determine which options make > +sense to enable on your system. > + > +Understanding system run-time activity > +====================================== > + > +Enabling event tracing gives insight into system run-time activity. This is > +a good way to identify which parts of the kernel are used at a higher level > +while system is in and/or while a specific workload/process is running. > + > +Event tracing depends on the CONFIG_EVENT_TRACING option enabled. You can > +enable event tracing before starting workload/process. Event tracing allows > +you to dynamically enable and disable tracing on supported/available events. > +You can find available events, tracers, and filter functions in the following > +files:: > + > + /sys/kernel/debug/tracing/available_events > + /sys/kernel/debug/tracing/available_filter_functions > + /sys/kernel/debug/tracing/available_tracers > + > +Now this is how you can enable tracing:: > + > + sudo echo 1 > /sys/kernel/debug/tracing/events/enable > + > +Once the workload/process stops or when you decide you have the status you > +need, you can disable event tracing:: > + > + sudo echo 0 > /sys/kernel/debug/tracing/events/enable > + > +You can find the tracing information in the file:: > + > + /sys/kernel/debug/tracing > + > +Here is the information shown in this file:: > + > + cat trace > + # tracer: nop > + # > + # entries-in-buffer/entries-written: 0/0 #P:16 > + # > + # _-----=> irqs-off/BH-disabled > + # / _----=> need-resched > + # | / _---=> hardirq/softirq > + # || / _--=> preempt-depth > + # ||| / _-=> migrate-disable > + # |||| / delay > + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION > + # | | | ||||| | | > + That looks like the header, certainly not "the information" found in the file. Including some actual output would make the following discussion more comprehensible. > +Analyzing traces > +================ > + > +You will be able map the functions to system calls and other kernel features > +to get insight into the overall system activity while a workload/process is > +running. > + > +Map the NR (syscal) numbers from the trace to syscalls from the syscalls dump. (syscall) > +Categorize system calls and map them to Linux subsystems. Not sure what that sentence is trying to tell readers. Again, who is the audience; will a readership that needs to be told how to install auditd be able to make sense of this and act on it? > +Conclusion > +========== > + > +This document is intended to be used as a guide on how to gather higher level > +information about a system and its run-time activity. The approach described > +in this document helps us get insight into supported system calls, features, > +assess how secure a system is, and its run-time activity. > + > +References > +========== > + > + * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/checksyscalls.sh > + * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/get_feat.pl > + * https://github.com/a13xp0p0v/kconfig-hardened-check > + * https://docs.kernel.org/trace/index.html Thanks, jon