Re: Reducing redundancy in a collection of text files.

Linux for blind general discussion <blinux-list@xxxxxxxxxx> · Thu, 5 Jan 2023 15:17:44 +0000

Original Poster here,

An update on my end, I finally figured out something to type into
Google that gave me something useful... and I came across the cmp
command.

It does a byte-by-byte comparison of two files and prints out the
first byte that differs and what line it can be found on. That's
potentially useful on its own, but what's been most helpful is that,
if you run cmp on a pair of files that are identical aside from one
being longer and continuing past the end of the shorter file, it will
report EOF on the shorter file, which solves the problem of removing
files that are truncated versions.

Also, I've learned that diff, diff3, and sdiff have capabilities for
merging different versions of a file, though I haven't dug into the
details yet.

I'll give comm a look though.

On 1/5/23, Linux for blind general discussion <blinux-list@xxxxxxxxxx> wrote:
> You can look at the comm command for compairing files.
> It does not solve your problem, but might help you getting there.
> Regards, Willem
>
>
> On Wed, 28 Dec 2022, Linux for blind general discussion wrote:
>
>> Okay, I have two related issues, one regarding comparing text files
>> and one regarding the contents of a single text file, and in both
>> cases, I'm mostly working with transcripts of conversations I had with
>> an AI language model that I'm trying to clean up.
>>
>> For the first issue, mostly caused by sometimes saving a transcript at
>> a dozen points in the conversation, let's say we have two versions of
>> a file A and B.
>>
>> Ideally, B contains everything contained in A plus some extra content
>> not found in A. Since A has no unique content, it can be deleted
>> safely.
>>
>> By extention, ideally, if I have a dozen versions of a given file, the
>> above would hold for every link in the chain, and I could just do a wc
>> on the files and delete all but the longest file.
>>
>> Problem is, I can't be sure A doesn't have contents not found in B,
>> and on top of that, the file names aren't always descriptive, so it
>> isn't obvious when I should even try comparing the contents of two
>> files.
>>
>> I suspect diff has an option or set of options to detect when one or
>> both of a pair of files have unique contents, but diff's lack of batch
>> processing would make using such a bit of a pain even just running it
>> on the file pairs I know to be similar.
>>
>> Is there either a utility that will compare every pair of files in a
>> directory looking for contents found in one but not the other,
>> deleting files with no unique content or a way to have a bash script
>> loop through a directory with diff to do something similar?
>>
>> Does something like
>>
>> for file 1 in *.txt file2 in *.txt; do
>> diff $file1 $file2
>> done
>>
>> or nesting fore loops of this sort even work in bash? I honestly don't
>> know as I don't think I've ever written a script that had to loop
>> through a cartesian product of input files instead of a single set.
>>
>> The other issue is that the AI language model in question likes
>> repeating itself... I might get a dozen responses that are half new
>> and half quoting part of the previous response, leading to a deozen
>> copies of some paragraphs.
>>
>> I know the uniq command can find and remove duplicate lines in a file,
>> but it only works if the duplicates are adjacent, and sorting the file
>> to make the duplicates adjacent would destroy any semblance of the
>> files having an order... plus, I'm more interested in finding
>> duplicates at the paragraph level, not the line level and while some
>> of the files only have line breaks at the end of the paragraph, others
>> have line breaks mid paragraph... Also, it would be nice if, instead
>> of just deleting the duplicate paragraphs, the tool I use to automate
>> tracking them down replaced the duplicates with a marker indicating
>> the starting line number of the original and the first 40 or so
>> characters of the paragraph to facilitate wanting to either move the
>> duplicated paragraph to one of the later occurances or deciding to
>> keep some of the duplicates for one reason or another.
>>
>> Anyone know of any tools for locating repeated content in a file
>> without the limitations of uniq?
>>
>> And for either issue, I would prefer a command line solution.
>>
>> _______________________________________________
>> Blinux-list mailing list
>> Blinux-list@xxxxxxxxxx
>> https://listman.redhat.com/mailman/listinfo/blinux-list
>>
>>
>
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@xxxxxxxxxx
> https://listman.redhat.com/mailman/listinfo/blinux-list
>
>

_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list