On Tue, Jun 28 2022, Pavel Rappo wrote: > On Tue, Jun 28, 2022 at 12:58 PM Ævar Arnfjörð Bjarmason > <avarab@xxxxxxxxx> wrote: > > <snip> > >> But eventually you'll simply run into the regex engine being slow > > Since I know very little about git internals, I was under a naive > impression that a significant, if not comparable to that of regex, > portion of pickaxe's time is spent on computing diffs between > revisions. So I assumed that there was a way to pre-compute those > diffs. Yes and no, maybe sort of :) Firstly, -S doesn't involve a diff, it's comparing the raw pre-post image, and seeing how many times we match. -G does involve computing the diff. One the one hand we're fast at making diffs, but that really shouldn't be significant compared to the speed of a regex engine. The other side of this is that we're really stupid about how we invoke the regex engine, historical reasons, backwards compatibility & all that, but we: * Aren't compiling the regex once, and using it N times in some cases (I have some local patches to fix this) * Are computing matches one line at a time, when we could e.g. point PCRE to an entire diff with the right line-split options. * Are often doing needless work, e.g. in v2.33 I solved an issue with us continuing to create diffs when we could abort early (see f97fe358576 (pickaxe -G: don't special-case create/delete, 2021-04-12)), which resulted in some speed-up.q Some of these are tricky to fix. > <snip> > >> 2. Stick that into Lucene with trigram indexing, e.g. ElasticSearch >> might make this easy. > > <snip> > >> For someone familiar with the tools involved that should be about a day >> to get to a rough hacky solution, it's mostly gluing existing OTS >> software together. > > <snip> > > I'll see what I can do with external systems. You see, I initially > came from a similar repository exposed through OpenGrok. But I think > that something was wrong with the index or query syntax because I > couldn't find the things that I knew were there. I was able to secure > a git repo that was close to that of OpenGrok as I found pickaxe to be > robust albeit slow alternative for my searches. This is the first time I hear about OpenGrok, so no idea, sorry. One common pitfall with search indexes is that they tend to have a blacklist of words, e.g. Lucene will have "for", "or" and other common English words as part of its defaults, so if you're trying to e.g. find when you altered a for-loop you might silently be getting no results.