Re: improve ps performance

"panfy.fnst" <panfy.fnst@xxxxxxxxxxxxxx> · Sun, 21 Sep 2014 16:08:02 +0800

On 09/21/2014 02:57 AM, Dave Anderson wrote:
    
      
----- Original Message -----

      
        On 09/20/2014 03:15 AM, Dave Anderson wrote:

        
          ----- Original Message -----

          
            Hello Pan,

I've updated the patch I attached yesterday with a change that
caches the most-recent tgid search result.  From ~70% to ~90% of
the time, either the last tgid entry or the very next one in the
tgid_array is the one being searched for, so it's not necessary
to call bsearch() every time.  "help -t" will show the cache-hit
statistics.

Thanks,
   Dave

          
          Hello Pan,

This patch as written needs to be made less restrictive for use
on a live system.

When running on a live system that has many tasks constantly
forking/exec'ing, the "ps" command may occasionally fail like so:

   crash>  ps
        PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
         0      0   0  ffffffff81c13440  RU   0.0       0      0
         [swapper/0]
         0      0   1  ffff88021282d330  RU   0.0       0      0
         [swapper/1]
   >      0      0   2  ffff88021282dac0  RU   0.0       0      0
   >      [swapper/2]
         0      0   3  ffff88021282e250  RU   0.0       0      0
         [swapper/3]
         1      0   1  ffff880212828000  IN   0.0   50140   3120  systemd
         2      0   3  ffff880212828790  IN   0.0       0      0
         [kthreadd]
   ... [ cut ] ...
      7578  27670   0  ffff8801f45e3c80  DE   0.0       0      0  cc
      7622  27668   1  ffff880210ee3c80  ZO   0.0       0      0  info
      7629  27667   1  ffff8801075bd330  DE   0.0       0      0  rev
      7631  27680   0  ffff8801075bf170  ZO   0.0       0      0  printenv
      7635  27685   3  ffff880108bbe9e0  ZO   0.0       0      0  ypwhich
   ps: bsearch for tgid failed: task: ffff880210ee6250 tgid: 7654
   crash>

Without this patch, the search for the matching tgid would not generate
an error at all, but just quietly continue.

The problem is due to the task.tgid may change on a live system, or more
likely, the task itself may have been re-used.

I would like to fix it simply ignoring tgid bsearch failures on live
systems,
and just use the RSS stats stored in the per-tgid mm_struct.

Does that work for you?

Dave


.


        
        ok!
But I don't understand the meaning of "

fix it simply ignoring tgid bsearch failures on live systems,
and just use the RSS stats stored in the per-tgid mm_struct.

", if tgid may be changed, the tgid_array is useless on live systems.

      
      Well, in this case, it may be true for a particular task if the task struct
had been re-used in between the time the arrays were created and the time
that the "ps" command gets around to reading and displaying its various
statistics.  And so the command may read invalid data w/respect to that task.

But let's be clear -- that kind of behavior is, and always has been, an 
unavoidable circumstance when running the crash utility on live systems, or
when looking at a "live" dump.

It's not just the "ps" command, but any command that displays data that
is subject to the "shifting sands" syndrome, where the kernel data is
constantly being modified while the crash command is running.  

So the idea is to not just cancel the whole command with an error(FATAL...)
if such an anomoly occurs on a live system.


      
        And what is the "RSS stats stored in the per-tgid mm_struct" used for?

      
      Sorry -- I meant to quietly skip the checking of the other tasks in the
task group, and simply use whatever is stored in the mm_struct pointed to
by the original task.  Without your patch, if the tgid was not found, the
command would just continue.  With your patch applied, it would be OK 
do the error(FATAL) in the case of a static dumpfile.  But in the case of
a live system (or live dump), it's not worth killing the command at that
point. 

Clear?

Dave


      
        More clearly, please.
   thanks,
      Pan


      
      .


    
    Oh, Ok. 

      Can I do it like this patch which just ignore tgid besarch
      failures? 

    
  

From 42d81bfc739a8e1cf6d3b5a2c55b79e0a766d72d Mon Sep 17 00:00:00 2001
From: panfengyun <panfy.fnst@xxxxxxxxxxxxxx>
Date: Sun, 21 Sep 2014 15:46:17 +0800
Subject: [PATCH] ignore tgid bsearch failures

it's not worth canceling the whole command with an error(FATAL...)
when such an anomoly occurs on a live system. So it should be imply
ignored tgid bsearch failures on live systems, and just use the RSS
stats stored in the per-tgid mm_struct.
---
 crash-7.0.8/memory.c |   91 ++++++++++++++++++++++++-------------------------
 1 files changed, 45 insertions(+), 46 deletions(-)

diff --git a/crash-7.0.8/memory.c b/crash-7.0.8/memory.c
index 518c917..c046b10 100755
--- a/crash-7.0.8/memory.c
+++ b/crash-7.0.8/memory.c
@@ -4187,54 +4187,53 @@ get_task_mem_usage(ulong task, struct task_mem_usage *tm)
 				tg = (struct tgid_context *)bsearch(&tgid, tgid_array, RUNNING_TASKS(), 
 					sizeof(struct tgid_context), sort_by_tgid);
 
-			if (tg == NULL)
-				error(FATAL, "bsearch for tgid failed: task: %lx tgid: %ld\n", 
-					task, tgid.tgid);
-
-			/* find the first element which has the same tgid */
-			first = tg;
-			while ((first > tgid_array) && ((first - 1)->tgid == first->tgid)) 
-				first--;
-
-			/* find the last element which have same tgid */
-			last = tg;
-			while ((last < (tgid_array + (RUNNING_TASKS() - 1))) && 
-				(last->tgid == (last + 1)->tgid))
-				last++;
-
-			while (first <= last)
+			if (tg != NULL)
 			{
-				/* count 0 -> filepages */
-				if (!readmem(first->task +
-					OFFSET(task_struct_rss_stat) +
-					OFFSET(task_rss_stat_count), KVADDR,
-					&sync_rss,
-					sizeof(int),
-					"task_struct rss_stat MM_FILEPAGES",
-					RETURN_ON_ERROR))
-						continue;
+				/* find the first element which has the same tgid */
+				first = tg;
+				while ((first > tgid_array) && ((first - 1)->tgid == first->tgid)) 
+					first--;
+
+				/* find the last element which have same tgid */
+				last = tg;
+				while ((last < (tgid_array + (RUNNING_TASKS() - 1))) && 
+					(last->tgid == (last + 1)->tgid))
+					last++;
+
+				while (first <= last)
+				{
+					/* count 0 -> filepages */
+					if (!readmem(first->task +
+						OFFSET(task_struct_rss_stat) +
+						OFFSET(task_rss_stat_count), KVADDR,
+						&sync_rss,
+						sizeof(int),
+						"task_struct rss_stat MM_FILEPAGES",
+						RETURN_ON_ERROR))
+							continue;
 
-				rss += sync_rss;
-
-				/* count 1 -> anonpages */
-				if (!readmem(first->task +
-					OFFSET(task_struct_rss_stat) +
-					OFFSET(task_rss_stat_count) +
-					sizeof(int),
-					KVADDR, &sync_rss,
-					sizeof(int),
-					"task_struct rss_stat MM_ANONPAGES",
-					RETURN_ON_ERROR))
-						continue;
+					rss += sync_rss;
+
+					/* count 1 -> anonpages */
+					if (!readmem(first->task +
+						OFFSET(task_struct_rss_stat) +
+						OFFSET(task_rss_stat_count) +
+						sizeof(int),
+						KVADDR, &sync_rss,
+						sizeof(int),
+						"task_struct rss_stat MM_ANONPAGES",
+						RETURN_ON_ERROR))
+							continue;
 
-				rss += sync_rss;
+					rss += sync_rss;
 
-				if(first == last)
-					break;
-				first++;
-			}
+					if(first == last)
+						break;
+					first++;
+				}
 
-			tt->last_tgid = last;
+				tt->last_tgid = last;
+			}
 		}
 
 		/*  
-- 
1.7.1

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility