Re: Reducing redundancy in a collection of text files.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

just today I have heard about rmlint:
https://rmlint.readthedocs.io/en/latest/tutorial.html
https://rmlint.readthedocs.io/en/latest/

I didn't try it yet, but it looks like it could do what you want.

Cheers,
Didier
-- 
Didier Spaier
Slint maintainer

Le 28/12/2022 à 02:00, Linux for blind general discussion a écrit :
> Okay, I have two related issues, one regarding comparing text files
> and one regarding the contents of a single text file, and in both
> cases, I'm mostly working with transcripts of conversations I had with
> an AI language model that I'm trying to clean up.
> 
> For the first issue, mostly caused by sometimes saving a transcript at
> a dozen points in the conversation, let's say we have two versions of
> a file A and B.
> 
> Ideally, B contains everything contained in A plus some extra content
> not found in A. Since A has no unique content, it can be deleted
> safely.
> 
> By extention, ideally, if I have a dozen versions of a given file, the
> above would hold for every link in the chain, and I could just do a wc
> on the files and delete all but the longest file.
> 
> Problem is, I can't be sure A doesn't have contents not found in B,
> and on top of that, the file names aren't always descriptive, so it
> isn't obvious when I should even try comparing the contents of two
> files.
> 
> I suspect diff has an option or set of options to detect when one or
> both of a pair of files have unique contents, but diff's lack of batch
> processing would make using such a bit of a pain even just running it
> on the file pairs I know to be similar.
> 
> Is there either a utility that will compare every pair of files in a
> directory looking for contents found in one but not the other,
> deleting files with no unique content or a way to have a bash script
> loop through a directory with diff to do something similar?
> 
> Does something like
> 
> for file 1 in *.txt file2 in *.txt; do
> diff $file1 $file2
> done
> 
> or nesting fore loops of this sort even work in bash? I honestly don't
> know as I don't think I've ever written a script that had to loop
> through a cartesian product of input files instead of a single set.
> 
> The other issue is that the AI language model in question likes
> repeating itself... I might get a dozen responses that are half new
> and half quoting part of the previous response, leading to a deozen
> copies of some paragraphs.
> 
> I know the uniq command can find and remove duplicate lines in a file,
> but it only works if the duplicates are adjacent, and sorting the file
> to make the duplicates adjacent would destroy any semblance of the
> files having an order... plus, I'm more interested in finding
> duplicates at the paragraph level, not the line level and while some
> of the files only have line breaks at the end of the paragraph, others
> have line breaks mid paragraph... Also, it would be nice if, instead
> of just deleting the duplicate paragraphs, the tool I use to automate
> tracking them down replaced the duplicates with a marker indicating
> the starting line number of the original and the first 40 or so
> characters of the paragraph to facilitate wanting to either move the
> duplicated paragraph to one of the later occurances or deciding to
> keep some of the duplicates for one reason or another.
> 
> Anyone know of any tools for locating repeated content in a file
> without the limitations of uniq?
> 
> And for either issue, I would prefer a command line solution.
> 
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@xxxxxxxxxx
> https://listman.redhat.com/mailman/listinfo/blinux-list
> 

_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list




[Index of Archives]     [Linux Speakup]     [Fedora]     [Linux Kernel]     [Yosemite News]     [Big List of Linux Books]