Good Morning Everyone,
While planning work, the CPE team has realized that a number of our initiatives actually start with a research phase to find the most appropriate technical solution. This leads to some issues with planning as without knowing the technical solution we want to take, it's hard to evaluate the amount of work needed and thus the time it'll take to do it.
In order to help with this, we're creating a small sub-team in CPE, called the ARC team for Advance Reconaissance Crew*. The goal of this team will be to investigate what we believe to be the possible technical solutions for initiatives and advise the team on what they believe would be the appropriate solution. To this end, we will reach out when we start looking for ideas as you may have ideas that we did not think about.
The first investigation, led by Will Woods, Mark O'Brien and I, will be around datanommer and datagrepper.
datanommer is an application listening to fedmsg and filling a (postgresql) database with all the messages passing on the bus. datagrepper is a web application exposing these messages and offering a way to filter or search them. available at: https://apps.fedoraproject.org/datagrepper/
Currently our ideas are: - for datanommer: - port it to fedora-messaging - adjust it to whichever solution we chose to replace datagrepper
- for datagrepper: - keep it as is - Replace by - postgres https://postgrest.org/ - prest https://github.com/prest/prest - kinto https://docs.kinto-storage.org/en/stable/ - Swagger/OpenAPI https://swagger.io/ - Add support for Graphql
- for the postgresql server - Split messages per year in different table - Unite them using a postgresql view - Kick out the old messages per year - Keep the current year + n-1 in the current DB - Kick the other to another DB? - Kick the other to a tarball somewhere? - Output the database daily dump to file / year - TimescaleDB a postgresql plugin for time-series data - https://alibaba-cloud.medium.com/postgresql-time-series-database-plug-in-tim... - https://dev.t-matix.com/blog/postgresql-as-a-time-series-database/ - https://docs.timescale.com/latest/introduction - Make the msg field in the message table be a JSON field
Would you have any other ideas of things we could look at?
Looking forward for your input,
Thanks, Pierre, Will and Mark
* Our notes and documentation are hosted at: https://fedora-arc.readthedocs.io/en/latest/index.html
On Mon, Jan 18, 2021 at 04:25:09PM +0100, Pierre-Yves Chibon wrote:
Would you have any other ideas of things we could look at?
Perhaps this? https://pagure.io/fedora-infrastructure/issue/9580
There are definitely some unknowns there, like "maybe it would be better to implement DiscourseConnect rather than improving Discourse's oauth2 implementation".
On Mon, Jan 18, 2021 at 10:38:42AM -0500, Matthew Miller wrote:
On Mon, Jan 18, 2021 at 04:25:09PM +0100, Pierre-Yves Chibon wrote:
Would you have any other ideas of things we could look at?
Perhaps this? https://pagure.io/fedora-infrastructure/issue/9580
That's not an answer I was expecting :D The question was on: are there other ideas of things we could look at for replacing/upgrading/improving datanommer and datagrepper?
Pierre
On Mon, Jan 18, 2021 at 05:10:33PM +0100, Pierre-Yves Chibon wrote:
On Mon, Jan 18, 2021 at 04:25:09PM +0100, Pierre-Yves Chibon wrote:
Would you have any other ideas of things we could look at?
Perhaps this? https://pagure.io/fedora-infrastructure/issue/9580
That's not an answer I was expecting :D The question was on: are there other ideas of things we could look at for replacing/upgrading/improving datanommer and datagrepper?
Ohhhh I was taking it more generally.
For Datagrepper, I'd love to expand my use case of creating these graphs:
https://mattdm.org/fedora/fedora-contributor-trends/
for which the current implementation is horrific.
On Mon, Jan 18, 2021 at 04:25:09PM +0100, Pierre-Yves Chibon wrote:
Good Morning Everyone,
...snip...
Currently our ideas are:
for datanommer:
- port it to fedora-messaging
- adjust it to whichever solution we chose to replace datagrepper
for datagrepper:
- keep it as is
- Replace by
- postgres https://postgrest.org/
- prest https://github.com/prest/prest
- kinto https://docs.kinto-storage.org/en/stable/
- Swagger/OpenAPI https://swagger.io/
Doing any of those means existing queries no longer work right? Thats kind of a pain. ;(
- Add support for Graphql
- for the postgresql server
- Split messages per year in different table
- Unite them using a postgresql view
I've long wanted to do this. ;)
It might be worth making it non default to query more than the most recent year. Most queries won't need anything that old...
- Kick out the old messages per year - Keep the current year + n-1 in the current DB - Kick the other to another DB? - Kick the other to a tarball somewhere?
I would prefer to keep it queryable... there's some things that may want to query the entire backhistory.
- Output the database daily dump to file / year - TimescaleDB a postgresql plugin for time-series data - https://alibaba-cloud.medium.com/postgresql-time-series-database-plug-in-timescaledb-deployment-practices-6a07e246eb0d - https://dev.t-matix.com/blog/postgresql-as-a-time-series-database/ - https://docs.timescale.com/latest/introduction - Make the msg field in the message table be a JSON field
If you do this would you convert all the old messages?
Would you have any other ideas of things we could look at?
I'm not sure if there's any compression we could use here, but it would be nice if the data took up less room. :)
Thanks for looking into this!
kevin
On Mon, Jan 18, 2021 at 05:41:35PM -0800, Kevin Fenzi wrote:
On Mon, Jan 18, 2021 at 04:25:09PM +0100, Pierre-Yves Chibon wrote:
Good Morning Everyone,
...snip...
Currently our ideas are:
for datanommer:
- port it to fedora-messaging
- adjust it to whichever solution we chose to replace datagrepper
for datagrepper:
- keep it as is
- Replace by
- postgres https://postgrest.org/
- prest https://github.com/prest/prest
- kinto https://docs.kinto-storage.org/en/stable/
- Swagger/OpenAPI https://swagger.io/
Doing any of those means existing queries no longer work right? Thats kind of a pain. ;(
Yes, that will need to be taken into account when a decision is made on which of these solutions to pick.
- Add support for Graphql
- for the postgresql server
- Split messages per year in different table
- Unite them using a postgresql view
I've long wanted to do this. ;)
It might be worth making it non default to query more than the most recent year. Most queries won't need anything that old...
- Kick out the old messages per year - Keep the current year + n-1 in the current DB - Kick the other to another DB? - Kick the other to a tarball somewhere?
I would prefer to keep it queryable... there's some things that may want to query the entire backhistory.
Fair. Again, the idea was to list every ideas here and then we can weight them again each other.
- Output the database daily dump to file / year - TimescaleDB a postgresql plugin for time-series data - https://alibaba-cloud.medium.com/postgresql-time-series-database-plug-in-timescaledb-deployment-practices-6a07e246eb0d - https://dev.t-matix.com/blog/postgresql-as-a-time-series-database/ - https://docs.timescale.com/latest/introduction - Make the msg field in the message table be a JSON field
If you do this would you convert all the old messages?
Yes
Would you have any other ideas of things we could look at?
I'm not sure if there's any compression we could use here, but it would be nice if the data took up less room. :)
Worth looking into as well.
Thanks for looking into this!
Hopefully it'll be productive and successful!
Pierre
On Mon, Jan 18, 2021 at 04:25:09PM +0100, Pierre-Yves Chibon wrote:
Good Morning Everyone,
While planning work, the CPE team has realized that a number of our initiatives actually start with a research phase to find the most appropriate technical solution. This leads to some issues with planning as without knowing the technical solution we want to take, it's hard to evaluate the amount of work needed and thus the time it'll take to do it.
In order to help with this, we're creating a small sub-team in CPE, called the ARC team for Advance Reconaissance Crew*. The goal of this team will be to investigate what we believe to be the possible technical solutions for initiatives and advise the team on what they believe would be the appropriate solution. To this end, we will reach out when we start looking for ideas as you may have ideas that we did not think about.
The first investigation, led by Will Woods, Mark O'Brien and I, will be around datanommer and datagrepper.
datanommer is an application listening to fedmsg and filling a (postgresql) database with all the messages passing on the bus. datagrepper is a web application exposing these messages and offering a way to filter or search them. available at: https://apps.fedoraproject.org/datagrepper/
Currently our ideas are:
for datanommer:
- port it to fedora-messaging
- adjust it to whichever solution we chose to replace datagrepper
for datagrepper:
- keep it as is
- Replace by
- postgres https://postgrest.org/
- prest https://github.com/prest/prest
- kinto https://docs.kinto-storage.org/en/stable/
- Swagger/OpenAPI https://swagger.io/
- Add support for Graphql
for the postgresql server
- Split messages per year in different table
- Unite them using a postgresql view
- Kick out the old messages per year
- Keep the current year + n-1 in the current DB
- Kick the other to another DB?
- Kick the other to a tarball somewhere?
- Output the database daily dump to file / year
- TimescaleDB a postgresql plugin for time-series data
- Make the msg field in the message table be a JSON field
Would you have any other ideas of things we could look at?
Just as a follow up to this thread, our findings can be found at: https://fedora-arc.readthedocs.io/en/latest/datanommer_datagrepper/index.htm... and I've also presented them in a blog post at: http://blog.pingoured.fr/index.php?post/2021/02/26/datanommer/datagrepper-in...
Hoping this helps, Pierre
infrastructure@lists.fedoraproject.org