Systematically crawling Fedoraproject.org repositories

Mike McGrath mmcgrath at redhat.com
Thu Sep 2 16:55:42 UTC 2010


On Thu, 2 Sep 2010, Ben St. John wrote:

> On Thu, Sep 2, 2010 at 5:34 PM, Mike McGrath <mmcgrath at redhat.com> wrote:
> > On Thu, 2 Sep 2010, Pascal Minnerup wrote:
> >
> >> Dear Fedora team,
> >>
> >> We on the Google Code Search project (www.google.com/codesearch) want to improve the quality of our index, and as part of that, would like to systematically crawl the fedora
> >> git repositories of fedoraproject.org, which we consider one of the major hosts of open source. Our crawlers use bandwidth throttling that should ensure that we don't
> >> overstress your web servers.
> >>
> >> 1. Is it okay for you if we systematically crawl your git repositories for new source code?
> >>
> >> 2. How would you recommend we get the repository directories? Our current approach would be to get the git repositories of recently updated packages from this page:
> >> http://pkgs.fedoraproject.org/gitweb/?o=age.
> >>
> >> 3. Are there any particular times or actions we should _avoid_?
> >>
> >> 4. Is there any particular person we should talk to in the future?
> >>
> >> An answer to these questions would be very helpful in improving the presence of Fedora code files in Code Search. We look forward to hearing from you.
> >>
> >
> > Thanks for contacting us, we really don't know how that would all react
> > but I'm ok with it provided we can contact you to change things later if
> > things do go south?
> >
> >        -Mike
>
> Of course! We'll try to give you a heads-up the first time we crawl
> it, so if you do notice anything strange, you'll know who to blame!
>

I'm not sure if you're already looking at the fedorahosted repos but we
have several web based repos at

http://git.fedorahosted.org/git/
http://hg.fedorahosted.org/hg/
http://bzr.fedorahosted.org/bzr/

	-Mike


More information about the websites mailing list