Okay, I have two related issues, one regarding comparing text files and one regarding the contents of a single text file, and in both cases, I'm mostly working with transcripts of conversations I had with an AI language model that I'm trying to clean up. For the first issue, mostly caused by sometimes saving a transcript at a dozen points in the conversation, let's say we have two versions of a file A and B. Ideally, B contains everything contained in A plus some extra content not found in A. Since A has no unique content, it can be deleted safely. By extention, ideally, if I have a dozen versions of a given file, the above would hold for every link in the chain, and I could just do a wc on the files and delete all but the longest file. Problem is, I can't be sure A doesn't have contents not found in B, and on top of that, the file names aren't always descriptive, so it isn't obvious when I should even try comparing the contents of two files. I suspect diff has an option or set of options to detect when one or both of a pair of files have unique contents, but diff's lack of batch processing would make using such a bit of a pain even just running it on the file pairs I know to be similar. Is there either a utility that will compare every pair of files in a directory looking for contents found in one but not the other, deleting files with no unique content or a way to have a bash script loop through a directory with diff to do something similar? Does something like for file 1 in *.txt file2 in *.txt; do diff $file1 $file2 done or nesting fore loops of this sort even work in bash? I honestly don't know as I don't think I've ever written a script that had to loop through a cartesian product of input files instead of a single set. The other issue is that the AI language model in question likes repeating itself... I might get a dozen responses that are half new and half quoting part of the previous response, leading to a deozen copies of some paragraphs. I know the uniq command can find and remove duplicate lines in a file, but it only works if the duplicates are adjacent, and sorting the file to make the duplicates adjacent would destroy any semblance of the files having an order... plus, I'm more interested in finding duplicates at the paragraph level, not the line level and while some of the files only have line breaks at the end of the paragraph, others have line breaks mid paragraph... Also, it would be nice if, instead of just deleting the duplicate paragraphs, the tool I use to automate tracking them down replaced the duplicates with a marker indicating the starting line number of the original and the first 40 or so characters of the paragraph to facilitate wanting to either move the duplicated paragraph to one of the later occurances or deciding to keep some of the duplicates for one reason or another. Anyone know of any tools for locating repeated content in a file without the limitations of uniq? And for either issue, I would prefer a command line solution. _______________________________________________ Blinux-list mailing list Blinux-list@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/blinux-list