Re: Code Search for Fedora

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 18 Nov 2014 13:00:22 -0800
Michael Stapelberg <michael+fedora@xxxxxxxxxxxxx> wrote:

> Hey,

Greetings. 

> Recently I’ve been talking to Hannes (cc'ed) about whether Fedora
> would be interested in having the equivalent of
> http://codesearch.debian.net/¹;
> 
> The project came to live as my Bachelor of Science Thesis² and aims to
> provide fast regular expression search over a big corpus, in this case
> 140 GB of source code of all software included in the Debian main
> distribution (as opposed to non-free or contrib, which we excluded
> because of licensing concerns). It is based on the work Russ Cox
> published, which in turn resembles the work he did on Google Code
> Search when he was an intern there in 2006.
> 
> So, what’s this discussion about?
> 
> What I’m offering is setting up/running a public version of Code
> Search for Fedora. It needs to be public because I want the open
> source community as a whole profit from it, and also I’m told you have
> somewhat comparable tools internally anyway :).

We have talked about a code search type application several times in
the past, but never got as far as coding. 

Some things to note about our infrastructure: 

Everything we use must be under a free license: 
https://fedoraproject.org/wiki/Infrastructure_Licensing
(which I don't think will be a problem, just noting it. ;) 

We have a process for bringing up new applications, called "Request For
Resources": 
https://fedoraproject.org/wiki/Request_For_Resources?rd=Infrastructure/RFR

Through this process we make sure there's more than one person that
knows how the application works and can fix it, it's monitored right,
etc. 

> 
> My motivation comes from multiple places:
> 
> 1) I’m fairly sure Fedora packages a slightly different set of
> software than Debian, so running both DCS (Debian Code Search) and FCS
> (Fedora Code Search) would enlarge the amount of searchable software.

Probibly true. Also, possibly differing versions... 

> 2) I’m interested in my work having a positive effect on the world (or
> at least the open source community), and running multiple instances of
> Code Search reduces its dependency on any single distribution, thereby
> increasing its reliability and scope.

Reasonable. 
 
> 3) Last but not least, I intend to try Fedora on one of my computers
> to broaden my horizons. I figured getting in contact with some of you
> while working on this project may be a good way to set a foot into the
> community and see whether I like it around here.

Welcome. :) Hope you like it 

> In terms of what I’d need in order to make this project a success,
> there are some hardware requirements (aside from, of course, time and
> motivation):
> 
> The in-memory index and searchable source code can be sharded on an
> almost arbitrary number of different computers, which is necessary to
> some extent, due to maximum size limitations for the index of a single
> shard to be < 2 GB. At the moment, we are running 6 different
> index-backend VMs, each serving 1.8G in-memory indexes and about 40G
> of source code (including partial indexes). In order to grep through
> the source quickly, the source is stored on local SSDs (as opposed to
> a network block storage volume, or even regular HDDs).

We currently don't have any SSD's. ;( 

> In addition to the actual data, we also need a web frontend to serve
> and combine this data, and we have one more VM which scrapes
> monitoring information and shows nice graphs about how the whole
> system behaves.
> 
> So, in total, we run 8 VMs, of which 6 are equipped with 4 cores, 4G
> of RAM (for 2G of index + 2G page cache for grepping files) and 40G
> SSD volumes each. The web frontend uses 4 cores and 2G of RAM, and
> also an SSD for caching entire query results. The monitoring VM needs
> just one core and 2G of RAM.
> 
> Does that sound reasonable and feasible? I’m not sure what kind of
> hardware you have available for projects like this one, and currently
> we’re sponsored by Rackspace because Debian doesn’t have that sort of
> hardware easily available.

Well, we don't have any virthosts with SSD's currently, so that could
be a hangup. We do have virthosts and memory/SAS disks. 
 
> I feel like this email is long enough already, so I’ll just ask a
> general: what do you think? Do you need any more information? Please
> just ask, and keep me CC'ed, since I’m not subscribed to this list.

I think before we go looking into hardware requirements, we should
discuss the software? Whats it written in? Is there a bunch of people
who work on it? or just you?

We would want it packaged up as rpms for deployment, preferably for
epel7 (to work on rhel7 hosts). 

Would you be open to changes in code/architecture to meet our setup
better?

Again, welcome... 

kevin

Attachment: pgpCpshUhHp8C.pgp
Description: OpenPGP digital signature

_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

[Index of Archives]     [Fedora Development]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux