Hello, just today I have heard about rmlint: https://rmlint.readthedocs.io/en/latest/tutorial.html https://rmlint.readthedocs.io/en/latest/ I didn't try it yet, but it looks like it could do what you want. Cheers, Didier -- Didier Spaier Slint maintainer Le 28/12/2022 à 02:00, Linux for blind general discussion a écrit : > Okay, I have two related issues, one regarding comparing text files > and one regarding the contents of a single text file, and in both > cases, I'm mostly working with transcripts of conversations I had with > an AI language model that I'm trying to clean up. > > For the first issue, mostly caused by sometimes saving a transcript at > a dozen points in the conversation, let's say we have two versions of > a file A and B. > > Ideally, B contains everything contained in A plus some extra content > not found in A. Since A has no unique content, it can be deleted > safely. > > By extention, ideally, if I have a dozen versions of a given file, the > above would hold for every link in the chain, and I could just do a wc > on the files and delete all but the longest file. > > Problem is, I can't be sure A doesn't have contents not found in B, > and on top of that, the file names aren't always descriptive, so it > isn't obvious when I should even try comparing the contents of two > files. > > I suspect diff has an option or set of options to detect when one or > both of a pair of files have unique contents, but diff's lack of batch > processing would make using such a bit of a pain even just running it > on the file pairs I know to be similar. > > Is there either a utility that will compare every pair of files in a > directory looking for contents found in one but not the other, > deleting files with no unique content or a way to have a bash script > loop through a directory with diff to do something similar? > > Does something like > > for file 1 in *.txt file2 in *.txt; do > diff $file1 $file2 > done > > or nesting fore loops of this sort even work in bash? I honestly don't > know as I don't think I've ever written a script that had to loop > through a cartesian product of input files instead of a single set. > > The other issue is that the AI language model in question likes > repeating itself... I might get a dozen responses that are half new > and half quoting part of the previous response, leading to a deozen > copies of some paragraphs. > > I know the uniq command can find and remove duplicate lines in a file, > but it only works if the duplicates are adjacent, and sorting the file > to make the duplicates adjacent would destroy any semblance of the > files having an order... plus, I'm more interested in finding > duplicates at the paragraph level, not the line level and while some > of the files only have line breaks at the end of the paragraph, others > have line breaks mid paragraph... Also, it would be nice if, instead > of just deleting the duplicate paragraphs, the tool I use to automate > tracking them down replaced the duplicates with a marker indicating > the starting line number of the original and the first 40 or so > characters of the paragraph to facilitate wanting to either move the > duplicated paragraph to one of the later occurances or deciding to > keep some of the duplicates for one reason or another. > > Anyone know of any tools for locating repeated content in a file > without the limitations of uniq? > > And for either issue, I would prefer a command line solution. > > _______________________________________________ > Blinux-list mailing list > Blinux-list@xxxxxxxxxx > https://listman.redhat.com/mailman/listinfo/blinux-list > _______________________________________________ Blinux-list mailing list Blinux-list@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/blinux-list