EPEL Django deprecated

Sat Jun 8 01:21:28 UTC 2013

On 06/07/2013 05:08 PM, Stephen John Smoogen wrote:
> On 7 June 2013 13:48, Matthew Miller <mattdm at fedoraproject.org> wrote:
> 
>> On Fri, Jun 07, 2013 at 01:31:36PM -0600, Stephen John Smoogen wrote:
>>> The easiest way I could see is just get a better sampling method which
>>> would be to have funding for a mirror which we then put into
>> mirror-manager
>>> and we know that this is a sampling versus a request info. (basically we
>>> would see what packages are downloaded directly and then extend that
>> sample
>>> from the amount of downloads to the 500,000 systems that check in via
>>> mirrormanager). The problems involved are paying for systems, storage,
>> and
>>> bandwidth for such items.
>>
>> Maybe one of the mirrors would be able to provide logs?
>>
>>
> Possibly. In the past mirror admins have not wanted to do so for many
> reasons (can't keep logs longer than 24 hours for policy reasons, can't
> give over logs without a formal agreement and then with as much redacted as
> possible, if we do it for X then we have to do it for everyone so no
> thankyou.) When I was at my university gig, it had to go up 4 levels of
> management before I gave up at the sub-CIO level.)
> 
> I have tried looking at the top level mirrors but most of the data is
> swamped out by other sites mirroring and lots of people doing development
> work and pointing to repos directly. This led to some strange statistics
> where trying to pull out even most of the noise made for various packages
> to "stand out" until I realized they were pulled in for cross-compiles and
> such (or the site that likes to do partial mirrors every couple of hours
> but always pulls in the same 4 packages each time even when it pulls in
> others.) I am expecting that other mirrors are going to run into that which
> means that stuff that a lot of sites could give out (just the urls per day)
> versus the IP address, URL would mean that the data would have a lot of
> weird noise that makes say zvbi show up high because it is both getting
> mirrored as the last package on the server and also because 8 packages use
> it as depends (not true but I can't remember the package that showed up a
> ton.)
> 
> In either case, it is what got me to realize that a mirror is needed to
> allow for better statistics of this sort because the data can be cleaned as
> needed versus pre-cleaned and reanimated.

Compelling information, thanks. I might still want to pursue improving
the data collection across an existing mirror network, but for now I
like your idea of inserting a tracking-mirror in to the system.

I've been doing a lot of thinking lately about mirroring, logs, and
anonymity. This is because I think we want to get more data about EPEL
usage without raising privacy or other legal concerns. My impetus is
simple, EPEL is an enormously important and popular part of the Fedora
Project to all of us, and my job is helping make such projects wildly
successful. :) To figure out what wild success means and track our
progress, we need a better handle on usage.

A tracking-mirror could go something like this:

* Logs are rotated out to the trash regularly, e.g. 24 hours.[1]
* Data is gathered from logs in real time in an anonymous fashion, so
nothing non-anonymous is inserted in to the database. No connection is
retained between the data in the database and the logs not yet thrown away.
* The log data gathering process attempts to cleanse in real time before
writing to the database. (This aligns with your idea, yes?)
* Work closely with the cleansing tools for a period of time to get a
handle on the sort of confusions you've experienced; see if programmatic
predictions can help keep watch in the future (e.g. alert on unusual
spikes in traffic to a small package set with certain patterns such as
near-each-other-alphabetically or used-together-often-as-dependencies.)
* We use statistical analysis to extrapolate wider conclusions.
* We make it possible to grow this tracking-mirror network within the
existing mirror network to improve the dataset.
* Throughout, code and configurations are dealt with transparently so it
is clear to community members not only that a better quality of tracking
is happening, but what the results of that tracking are (the analysis
itself, actions taken from the analysis that benefits users), and that
all details are there showing the protection of privacy.

I'm interested in championing this idea to get the resources (server,
bandwidth, peoples, code, etc.) to make at least the initial mirror
happen. With the right plan, I could see getting things in place pretty
quickly e.g. September.

- Karsten

[1] We could consider sending logs directly to /dev/null after data
collection if we felt data collection was sufficient. The main risk
there is in reducing the ability to troubleshoot. It's an interesting
thought exercise at least to find a way toward dropping non-anonymous
information without even a millisecond of retention. Such as, pulling
anonymous data to the dataset, then cleansing the stream toward privacy
before writing it to the log. For example, it might be sufficient for
troubleshooting to know a class C IP block but drop the specific IP address.
-- 
Karsten 'quaid' Wade
http://TheOpenSourceWay.org  .^\  http://community.redhat.com
@quaid (identi.ca/twitter/IRC)  \v'  gpg: AD0E0C41

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: OpenPGP digital signature
URL: <http://lists.fedoraproject.org/pipermail/epel-devel/attachments/20130607/aaa3946e/attachment.sig>