Hey, Recently I’ve been talking to Hannes (cc'ed) about whether Fedora would be interested in having the equivalent of http://codesearch.debian.net/¹; The project came to live as my Bachelor of Science Thesis² and aims to provide fast regular expression search over a big corpus, in this case 140 GB of source code of all software included in the Debian main distribution (as opposed to non-free or contrib, which we excluded because of licensing concerns). It is based on the work Russ Cox published, which in turn resembles the work he did on Google Code Search when he was an intern there in 2006. So, what’s this discussion about? What I’m offering is setting up/running a public version of Code Search for Fedora. It needs to be public because I want the open source community as a whole profit from it, and also I’m told you have somewhat comparable tools internally anyway :). My motivation comes from multiple places: 1) I’m fairly sure Fedora packages a slightly different set of software than Debian, so running both DCS (Debian Code Search) and FCS (Fedora Code Search) would enlarge the amount of searchable software. 2) I’m interested in my work having a positive effect on the world (or at least the open source community), and running multiple instances of Code Search reduces its dependency on any single distribution, thereby increasing its reliability and scope. 3) Last but not least, I intend to try Fedora on one of my computers to broaden my horizons. I figured getting in contact with some of you while working on this project may be a good way to set a foot into the community and see whether I like it around here. In terms of what I’d need in order to make this project a success, there are some hardware requirements (aside from, of course, time and motivation): The in-memory index and searchable source code can be sharded on an almost arbitrary number of different computers, which is necessary to some extent, due to maximum size limitations for the index of a single shard to be < 2 GB. At the moment, we are running 6 different index-backend VMs, each serving 1.8G in-memory indexes and about 40G of source code (including partial indexes). In order to grep through the source quickly, the source is stored on local SSDs (as opposed to a network block storage volume, or even regular HDDs). In addition to the actual data, we also need a web frontend to serve and combine this data, and we have one more VM which scrapes monitoring information and shows nice graphs about how the whole system behaves. So, in total, we run 8 VMs, of which 6 are equipped with 4 cores, 4G of RAM (for 2G of index + 2G page cache for grepping files) and 40G SSD volumes each. The web frontend uses 4 cores and 2G of RAM, and also an SSD for caching entire query results. The monitoring VM needs just one core and 2G of RAM. Does that sound reasonable and feasible? I’m not sure what kind of hardware you have available for projects like this one, and currently we’re sponsored by Rackspace because Debian doesn’t have that sort of hardware easily available. I feel like this email is long enough already, so I’ll just ask a general: what do you think? Do you need any more information? Please just ask, and keep me CC'ed, since I’m not subscribed to this list. Thanks in advance, Best regards, Michael Stapelberg ¹ Note that there is a rather big redesign in progress, both architecturally and visually: https://people.debian.org/~stapelberg//2014/11/09/upcoming-debian-codesearch.html So, in case you browse around on the current version and conclude that it sucks, just wait for the update and everything will be awesome ;). ² http://codesearch.debian.net/research/ _______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure