Code Search for Fedora

Tue Nov 18 21:00:22 UTC 2014

Hey,

Recently I’ve been talking to Hannes (cc'ed) about whether Fedora
would be interested in having the equivalent of
http://codesearch.debian.net/¹

The project came to live as my Bachelor of Science Thesis² and aims to
provide fast regular expression search over a big corpus, in this case
140 GB of source code of all software included in the Debian main
distribution (as opposed to non-free or contrib, which we excluded
because of licensing concerns). It is based on the work Russ Cox
published, which in turn resembles the work he did on Google Code
Search when he was an intern there in 2006.

So, what’s this discussion about?

What I’m offering is setting up/running a public version of Code
Search for Fedora. It needs to be public because I want the open
source community as a whole profit from it, and also I’m told you have
somewhat comparable tools internally anyway :).

My motivation comes from multiple places:

1) I’m fairly sure Fedora packages a slightly different set of
software than Debian, so running both DCS (Debian Code Search) and FCS
(Fedora Code Search) would enlarge the amount of searchable software.

2) I’m interested in my work having a positive effect on the world (or
at least the open source community), and running multiple instances of
Code Search reduces its dependency on any single distribution, thereby
increasing its reliability and scope.

3) Last but not least, I intend to try Fedora on one of my computers
to broaden my horizons. I figured getting in contact with some of you
while working on this project may be a good way to set a foot into the
community and see whether I like it around here.

In terms of what I’d need in order to make this project a success,
there are some hardware requirements (aside from, of course, time and
motivation):

The in-memory index and searchable source code can be sharded on an
almost arbitrary number of different computers, which is necessary to
some extent, due to maximum size limitations for the index of a single
shard to be < 2 GB. At the moment, we are running 6 different
index-backend VMs, each serving 1.8G in-memory indexes and about 40G
of source code (including partial indexes). In order to grep through
the source quickly, the source is stored on local SSDs (as opposed to
a network block storage volume, or even regular HDDs).

In addition to the actual data, we also need a web frontend to serve
and combine this data, and we have one more VM which scrapes
monitoring information and shows nice graphs about how the whole
system behaves.

So, in total, we run 8 VMs, of which 6 are equipped with 4 cores, 4G
of RAM (for 2G of index + 2G page cache for grepping files) and 40G
SSD volumes each. The web frontend uses 4 cores and 2G of RAM, and
also an SSD for caching entire query results. The monitoring VM needs
just one core and 2G of RAM.

Does that sound reasonable and feasible? I’m not sure what kind of
hardware you have available for projects like this one, and currently
we’re sponsored by Rackspace because Debian doesn’t have that sort of
hardware easily available.

I feel like this email is long enough already, so I’ll just ask a
general: what do you think? Do you need any more information? Please
just ask, and keep me CC'ed, since I’m not subscribed to this list.

Thanks in advance,
Best regards,
Michael Stapelberg

¹ Note that there is a rather big redesign in progress, both
architecturally and visually:
https://people.debian.org/~stapelberg//2014/11/09/upcoming-debian-codesearch.html

So, in case you browse around on the current version and conclude that
it sucks, just wait for the update and everything will be awesome ;).

² http://codesearch.debian.net/research/