RE: cannot run trace-cmd split in parallel

Sharon Gabay <Sharon.Gabay@xxxxxxxxxxxx> · Mon, 10 Jul 2023 06:44:53 +0000

Hi!

Running in parallel is needed because I'm using trace-cmd split to split a big job (analyzing multiple frames) between several processes to speed it up.

Unfortunately I don't currently build trace-cmd so can't try the patch, but writing this email made me think of another solution. I'm specifying output files in separate temporary directories (/tmp/1, /tmp/2 ...) and now it works perfectly!

I think it would be useful to have two fixes:
- make trace-cmd create the temporary output ("tmp.0.0") using either a uuid, or the name of the output file itself, or maybe add some suffix to it. In short, avoid collisions.
- I'm not sure why but "trace-cmd split -o /tmp/a" will actually write to /tmp/a.1, if possible it would be best to write to the exact name specified by the user.

Thanks!
Sharon

-----Original Message-----
From: Steven Rostedt <rostedt@xxxxxxxxxxx> 
Sent: יום א 09 יולי 2023 22:16
To: Sharon Gabay <Sharon.Gabay@xxxxxxxxxxxx>
Cc: linux-trace-users@xxxxxxxxxxxxxxx" <linux-trace-users@xxxxxxxxxxxxxxx>
Subject: Re: cannot run trace-cmd split in parallel

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

On Sun, 9 Jul 2023 15:39:46 +0000
Sharon Gabay <Sharon.Gabay@xxxxxxxxxxxx> wrote:

> Hi!
> 
> I've been having this very strange issue for as long as I've been 
> using "trace-cmd split". Actually I didn't write about it until now 
> because it's so strange I was sure the blame is on the user 😊
> 
> When I use "trace-cmd split" in parallel, I get randomly invalid 
> output. This happens specifically when I use the start/end arguments.

I have to admit that I never thought about running it in parallel.

> 
> To reproduce, take any trace.dat (as far as I can tell), and run the 
> following command, which is 5 nearly identical command lines run in 
> parallel. The only difference is in the start/end arguments. Without 
> this difference, the issue does not reproduce.

> 
> trace-cmd split -i trace.dat -o /tmp/out1 <start> <end> & trace-cmd 
> split -i trace.dat -o /tmp/out2 <start+1> <end+1> & trace-cmd split -i 
> trace.dat -o /tmp/out3 <start+2> <end+2> & trace-cmd split -i 
> trace.dat -o /tmp/out4 <start+3> <end+3> & trace-cmd split -i 
> trace.dat -o /tmp/out5 <start> <end>
> 
> If you compare the output of the first and last command, which are in 
> bold and you can see they are the exact same, the output is different. 
> diff /tmp/out1.1 /tmp/out5.1
> 
> But it's not consistent, every run will behave differently, so you 
> might need few runs to get this. It might also not happen at all, I 
> guess. Statistics.
> 
> You can expect the diff to be different if you see this in the
> stdout/stderr:

> …
> libtracecmd: No such file or directory
>   can not stat '/tmp/.tmp.tmp.0'
> trace-cmd: No such file or directory
>   Failed to append tracing data
> 
> libtracecmd: No such file or directoryש
>   can not stat '/tmp/.tmp.tmp.0'
> trace-cmd: No such file or directory
>   Failed to append tracing data
> 

> It looks like as if trace-cmd is using some file to store temporary 
> data, and the same filename is used by all processes.

It is suppose to use the output file to base the temp files on, but it appears that I got the dirname() and basename() backwards, and the dirname truncated the output file such that the basename was the same as the dir name. This causes all the temp files to be the same as the dir name, and you will hit this conflict if your output files share the same directory!

> 
> Can anyone help me understand this weird behavior?
> 

Can you try this patch to see if it fixes your situation?

-- Steve

diff --git a/tracecmd/trace-split.c b/tracecmd/trace-split.c index 1daa847d..57c4e64f 100644
--- a/tracecmd/trace-split.c
+++ b/tracecmd/trace-split.c
@@ -367,8 +367,8 @@ static double parse_file(struct tracecmd_input *handle,
 	int fd;
 
 	output = strdup(output_file);
-	dir = dirname(output);
 	base = basename(output);
+	dir = dirname(output);
 
 	ohandle = tracecmd_copy(handle, output_file, TRACECMD_FILE_CMD_LINES, 0, NULL);