Re: WIP: searching all of lore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Konstantin Ryabitsev <konstantin@xxxxxxxxxxxxxxxxxxx> wrote:
> On Thu, Nov 26, 2020 at 07:45:43PM +0000, Eric Wong wrote:
> > Requires Tor, for now:
> > 
> > http://rskvuqcfnfizkjg6h5jvovwb3wkikzcwskf54lfpymus6mxrzw67b5ad.onion/all/
> > http://lore.czquwvybam4bgbro.onion/all/
> 
> Thanks for this work, Eric, things are looking good in my tests, though
> I uncovered a bunch of problems with b4 when used with torsocks. :)
> 
> When grabbing t.mbox.gz threads from /all, it appears to properly
> reconstitute follow-ups from multiple mailing lists, correct?

Yup, though some duplicates appear due to different mailing list-added
trailers.  Maybe some of the PublicInbox::Filter::* stuff (currently
only for -mda + -watch) can be applied to the indexing phase to better
dedupe and drop trailers

> Is there a
> way to "weight" different sources, so that when the same message-id
> exist in multiple places, we can prefer one source over another?

It indexes based on the order it iterates through the inboxes
and messages.  That's usually that follows order in the config file;
especially if indexing is delayed.   Of course it's possible a
message can show up in a low-priority source first due to
network latency or outages (something I'm too familiar with :<).

I have any idea to fix that via --reindex which *might*
allow performance improvements on the Xapian side, too.

--reindex is another mind twister when dealing with multiple
histories compared to normal inboxes and will need a new
approach.  Been working on that and my head hurts :x

> For
> example, this is useful when we're trying to do DKIM validation and some
> lists are known to mess that up, while others do the right thing.

Right, though I think it's somewhat less necessary given how sensitive
PublicInbox::ContentHash is compared to just using the Message-ID to
dedupe...

One bad thing about it being too sensitive is NNTP speedups couldn't rely
solely on contents hashing because of mailing list trailers yesterday:

https://public-inbox.org/meta/20201130194201.GA6687@dcvr/

> Thanks again,

You're welcome :>



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux