I couldn't sleep so I wrote up some ideas that have been churning in my
head. I've discussed some of this with Remy but figure it'd be even
better to talk about it in public. :)
I think our mailing list history is a rich source for community
participation metrics, and I'd like to have some analysis beyond just
raw post counts, which don't really tell us much. Here's what I'm
thinking, in rough form....
I think we'd cover: fedora-council-discuss (including previous
incarnations), devel, test, server, cloud, workstation, marketing,
design, commops, docs, trans, (and others?); but not lists specific to
an individual program lists (e.g. anaconda). I'm not sure about
specific SIGs other than those associated with Fedora editions; the
information would be valuable to have but probably separately.
Likewise, the same info would be useful for the users list, but also
probably separately. (In any case, it'd be nice to be able to drill
down and separate things by list.)
We'd filter out known bots and automated posts. Possibly also ticket
traffic, but I'm actually kind of thinking that that's useful for some
places like FESCo where that's the _majority_ of the conversation.
So, then... collect and graph:
Users by Month
--------------
Probably adequate to consider "user" to be an e-mail address. Could
make a database of known multiple alias users -- like me at @mattdm.org
previously and @fedoraproject.org now.
1. Measure new posters each month -- indentifiers never before seen.
Possibly have threshold of at least three posts, to filter out "drive
throughs".
1a. Graph is simply new-users-per-month over time.
1b. Additional line on that graph: of those new users, how many also post
at least once ever again in a different month? (But count in month in
which they first posted.)
1c. Additional line: of those new users, how many also post in at least six
separate months after? (Same.)
Rationale: This will identify if there are times when we gain more new
contributors, and trends in contributor growth. Obviously 1b and 1c are only
valid in retrospect.
2. Categorize users into
2a. "New": only posted this month
2b. "Onboarding": also active sometime in prior six months but not before
2c. "Active": active in _all_ prior six months
2d. "Old School": active previously but _not_ in every one of previous six
months.
Graph those four lines by month as percentage of posts that month.
Rationale: another view on new users, but focuses on longevity rather than
growth and looks back rather than forward.
3. Categorize users by number of posts that month into percentile buckets:
"Prolific", "Involved", "Average", "Low", "Single Post".
Graph per month percentage of posts by each percentile bucket. Could also do
a visualization combining with #2.
Rationale: How much are the lists dominated by very active individuals?
Threads
-------
Probably adequate to consider subject line as thread identifier; could also
use actual reply-to headers.
4. For each month, categorize threads into buckets:
4a. Single posts
4b. Short threads: under 5 replies
4c. Normal threads: under 20 replies
4d. Megathreads! 20 or more replies.
Graph that count per month as percentage of posts which fall in each
category.
Rationale: Single posts with no replies (assuming we've filtered out
automated reports) are discouraging. Megathreads can indicate community
passion and big issues, but can also be indicative of communication
problems.
5. Further breakdown of megathreads into
5a. Megathreads with a majority of posts by two participants
5b. Megathreads with a majority of posts by fewer than five (or some other
small threshold) participants
5c. Megathreads with more participants than that
This could be simply an addition to the graph from #4 rather than
charted separately.
It might also be interesting (but much more work) to separate out
threads which start with a high number of participants but devolve to
5a over time.
Rationale: 5a is usually unhealthy.
6. For each month where a megathread exists, "color" based on percentage of
megathread participants in each category from #2 (and maybe #3 as well).
Rationale: helps identify character of megathreads.
7. It might be nice to run sentiment analysis across all posts each month. I
don't know much about this field, though.
Also
----
These same concepts would be useful for similar charts for other fedmsg
activity. For example: package maintenance, bodhi karma, etc. In fact,
it might be useful to have aggregate charts for different areas of
activity; for example, one for QA for charts 1 through 3 which
calculated values for posts, values for bodhi karma activity, and test
participation, and then just lumped them together into one "QA
activity" score (without even worrying about mapping user ids to match
between categories).
--
Matthew Miller
<mattdm(a)fedoraproject.org>
Fedora Project Leader