Re: Reducing redundancy in a collection of text files.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tim here.  Strangely, I didn't see the original posting, so following
up to the reply that came in today.

The for-loop doesn't work well unless there's a consistent naming
between the two file-names or you have the original cached somewhere.

If you need to diff multiple files, you need a way to associate one
with the other. In the below example, I imagined that they were named
"file_orig.txt" and "file_new.txt" which would turn your example into

  $ for fname in *_orig.txt ; do diff "$fname" "${fname%orig.txt}new.txt" ; done

which will find all the "*_orig.txt" files and compare them with
their corresponding "*_new.txt" versions.

Version-control is great for this sort of thing, whether git,
mercurial, fossil, or even svn, cvs, or rcs.  In this case, you
only have one filename, and the version control software manages
tracking the old-vs-new.  For example:

  $ git init .
  $ git add file1.txt file2.txt
  $ git commit -m "Initial checkin"
  $ edit file1.txt
  $ edit file2.txt
  $ git diff --word-diff=plain
  $ git commit -m "Made some satisfactory edits"
  $ edit file1.txt
  $ edit file2.txt
  $ git diff --word-diff=plain
  $ git commit -m "Made more edits"

That said, if you're *not* tracking them in version-control, you
can use diff or wdiff.  For prose, I find that `wdiff` is a little
easier to read

  $ wdiff file1.txt file2.txt

especially since you can set the text that is used for markers
around the text that was added/removed.  This lets you do crazy
things like use espeak's SSML markup mode if you want like

  $ wdiff \
     --start-delete '<s><prosody pitch="-8st">' \
     --end-delete='</prosody></s>' \
     --start-insert='<voice gender="female">' \
     --end-insert='</voice>' \
     file_orig.txt file_new.txt | espeak -m

so that deleted text gets spoken in a low voice and inserted text
gets spoken in a female voice.

Hopefully that puts a few tools in your belt that you can use to
make your work easier.

-tim


On 2023-01-05 12:19, Linux for blind general discussion wrote:
> You can look at the comm command for compairing files.
> It does not solve your problem, but might help you getting there.
> Regards, Willem
> 
> 
> On Wed, 28 Dec 2022, Linux for blind general discussion wrote:
> 
> >Okay, I have two related issues, one regarding comparing text files
> >and one regarding the contents of a single text file, and in both
> >cases, I'm mostly working with transcripts of conversations I had with
> >an AI language model that I'm trying to clean up.
> >
> >For the first issue, mostly caused by sometimes saving a transcript at
> >a dozen points in the conversation, let's say we have two versions of
> >a file A and B.
> >
> >Ideally, B contains everything contained in A plus some extra content
> >not found in A. Since A has no unique content, it can be deleted
> >safely.
> >
> >By extention, ideally, if I have a dozen versions of a given file, the
> >above would hold for every link in the chain, and I could just do a wc
> >on the files and delete all but the longest file.
> >
> >Problem is, I can't be sure A doesn't have contents not found in B,
> >and on top of that, the file names aren't always descriptive, so it
> >isn't obvious when I should even try comparing the contents of two
> >files.
> >
> >I suspect diff has an option or set of options to detect when one or
> >both of a pair of files have unique contents, but diff's lack of batch
> >processing would make using such a bit of a pain even just running it
> >on the file pairs I know to be similar.
> >
> >Is there either a utility that will compare every pair of files in a
> >directory looking for contents found in one but not the other,
> >deleting files with no unique content or a way to have a bash script
> >loop through a directory with diff to do something similar?
> >
> >Does something like
> >
> >for file 1 in *.txt file2 in *.txt; do
> >diff $file1 $file2
> >done
> >
> >or nesting fore loops of this sort even work in bash? I honestly don't
> >know as I don't think I've ever written a script that had to loop
> >through a cartesian product of input files instead of a single set.
> >
> >The other issue is that the AI language model in question likes
> >repeating itself... I might get a dozen responses that are half new
> >and half quoting part of the previous response, leading to a deozen
> >copies of some paragraphs.
> >
> >I know the uniq command can find and remove duplicate lines in a file,
> >but it only works if the duplicates are adjacent, and sorting the file
> >to make the duplicates adjacent would destroy any semblance of the
> >files having an order... plus, I'm more interested in finding
> >duplicates at the paragraph level, not the line level and while some
> >of the files only have line breaks at the end of the paragraph, others
> >have line breaks mid paragraph... Also, it would be nice if, instead
> >of just deleting the duplicate paragraphs, the tool I use to automate
> >tracking them down replaced the duplicates with a marker indicating
> >the starting line number of the original and the first 40 or so
> >characters of the paragraph to facilitate wanting to either move the
> >duplicated paragraph to one of the later occurances or deciding to
> >keep some of the duplicates for one reason or another.
> >
> >Anyone know of any tools for locating repeated content in a file
> >without the limitations of uniq?
> >
> >And for either issue, I would prefer a command line solution.
> >
> >_______________________________________________
> >Blinux-list mailing list
> >Blinux-list@xxxxxxxxxx
> >https://listman.redhat.com/mailman/listinfo/blinux-list
> >
> >
> 
> _______________________________________________
> Blinux-list mailing list
> Blinux-list@xxxxxxxxxx
> https://listman.redhat.com/mailman/listinfo/blinux-list
> 

_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list




[Index of Archives]     [Linux Speakup]     [Fedora]     [Linux Kernel]     [Yosemite News]     [Big List of Linux Books]