On Tue, 18 Nov 2014 13:00:22 -0800
Michael Stapelberg <michael+fedora(a)stapelberg.ch> wrote:
Recently I’ve been talking to Hannes (cc'ed) about whether
would be interested in having the equivalent of
The project came to live as my Bachelor of Science Thesis² and aims to
provide fast regular expression search over a big corpus, in this case
140 GB of source code of all software included in the Debian main
distribution (as opposed to non-free or contrib, which we excluded
because of licensing concerns). It is based on the work Russ Cox
published, which in turn resembles the work he did on Google Code
Search when he was an intern there in 2006.
So, what’s this discussion about?
What I’m offering is setting up/running a public version of Code
Search for Fedora. It needs to be public because I want the open
source community as a whole profit from it, and also I’m told you have
somewhat comparable tools internally anyway :).
We have talked about a code search type application several times in
the past, but never got as far as coding.
Some things to note about our infrastructure:
Everything we use must be under a free license:
(which I don't think will be a problem, just noting it. ;)
We have a process for bringing up new applications, called "Request For
Through this process we make sure there's more than one person that
knows how the application works and can fix it, it's monitored right,
My motivation comes from multiple places:
1) I’m fairly sure Fedora packages a slightly different set of
software than Debian, so running both DCS (Debian Code Search) and FCS
(Fedora Code Search) would enlarge the amount of searchable software.
Probibly true. Also, possibly differing versions...
2) I’m interested in my work having a positive effect on the world
at least the open source community), and running multiple instances of
Code Search reduces its dependency on any single distribution, thereby
increasing its reliability and scope.
3) Last but not least, I intend to try Fedora on one of my computers
to broaden my horizons. I figured getting in contact with some of you
while working on this project may be a good way to set a foot into the
community and see whether I like it around here.
Welcome. :) Hope you like it
In terms of what I’d need in order to make this project a success,
there are some hardware requirements (aside from, of course, time and
The in-memory index and searchable source code can be sharded on an
almost arbitrary number of different computers, which is necessary to
some extent, due to maximum size limitations for the index of a single
shard to be < 2 GB. At the moment, we are running 6 different
index-backend VMs, each serving 1.8G in-memory indexes and about 40G
of source code (including partial indexes). In order to grep through
the source quickly, the source is stored on local SSDs (as opposed to
a network block storage volume, or even regular HDDs).
We currently don't have any SSD's. ;(
In addition to the actual data, we also need a web frontend to serve
and combine this data, and we have one more VM which scrapes
monitoring information and shows nice graphs about how the whole
So, in total, we run 8 VMs, of which 6 are equipped with 4 cores, 4G
of RAM (for 2G of index + 2G page cache for grepping files) and 40G
SSD volumes each. The web frontend uses 4 cores and 2G of RAM, and
also an SSD for caching entire query results. The monitoring VM needs
just one core and 2G of RAM.
Does that sound reasonable and feasible? I’m not sure what kind of
hardware you have available for projects like this one, and currently
we’re sponsored by Rackspace because Debian doesn’t have that sort of
hardware easily available.
Well, we don't have any virthosts with SSD's currently, so that could
be a hangup. We do have virthosts and memory/SAS disks.
I feel like this email is long enough already, so I’ll just ask a
general: what do you think? Do you need any more information? Please
just ask, and keep me CC'ed, since I’m not subscribed to this list.
I think before we go looking into hardware requirements, we should
discuss the software? Whats it written in? Is there a bunch of people
who work on it? or just you?
We would want it packaged up as rpms for deployment, preferably for
epel7 (to work on rhel7 hosts).
Would you be open to changes in code/architecture to meet our setup