Big Data SIG Meeting Minutes: 2013-03-07
Bruno Mahé
bruno at bmahe.net
Sun Mar 10 01:35:36 UTC 2013
On 03/07/2013 10:40 AM, Robyn Bergeron wrote:
> Howdy,
>
> Minutes from today's meeting follow below.
>
> Had a really interesting conversation - though I think we didn't quite
> get through much of the agenda :) We did discuss some categories,
> things already packaged, etc. - I'm on duty to make a sub-wiki-page so
> we can collectively document some of these things.
>
> Also had some talk around packaging Apache Bigtop as well as the
> intersections with that community - as many of their folks apparently
> use Fedora (YAY).
>
> Anyway: Keep your intros, thoughts coming - I suspect we'll be
> gathering more humans in the coming weeks.
>
> Thanks for coming!
>
> -Robyn
>
>
> Minutes: http://meetbot.fedoraproject.org/fedora-meeting-1/2013-03-07/big_data_sig.2013-03-07-16.59.html
> Full logs: http://meetbot.fedoraproject.org/fedora-meeting-1/2013-03-07/big_data_sig.2013-03-07-16.59.log.html
>
Hi,
I am Bruno and am very interested in cloud and big data technologies.
As promised (although a little bit later than planned), I am following
up on the discussion from the irc meeting.
One of the project I work on intersects closely with this SIG. This
project is Apache Bigtop (http://bigtop.apache.org/).
The goal of Apache Bigtop is three folds:
1/ Provide top notch packages for Apache Hadoop related projects
2/ Provide a point of integration and testing for all these projects
3/ Provide means to reliably deploy a complete stack.
Apache Bigtop was donated by Cloudera to the Apache Foundation and is
now the upstream of CDH (Cloudera's distribution), ubuntu hadoop
packages (https://launchpad.net/~hadoop-ubuntu) and HDP 1.X from
Hortonworks (haven't checked 2.X and not sure to which extend they have
been modified).
We also use a lot of fedora into Apache Bigtop and I believe this SIG
and Apache Bigtop could benefit from each others at least in some areas.
For instance, Apache Bigtop provides live USB/CD images of Fedora with
Apache Hadoop (and a bunch of other projects) pre-installed.
We also use boxgrinder to build Centos VMs from a fedora build slave.
Right now, the list of projects Apache Bigtop supports is:
* jsvc
* tomcat
* bigtop-utils -> Misc. tools, such asauto-detecting a JVM
* crunch -> Library for writing/testing/running MapReduce pipelines
* datafu -> Collection of libraries for pig
* flume -> distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data
* giraph -> Graph processing on top of Apache Hadoop
* hadoop
* hama -> Apache Hama is a pure BSP (Bulk Synchronous Parallel)
computing framework
* hbase -> columnar database
* hive -> Hive is a data warehouse system for Hadoop that facilitates
easy data summarization, ad-hoc queries, and the analysis of large
datasets stored in Hadoop compatible file systems. Enables to run
SQL-like queries
* hue -> browser-based desktop interface for interacting with Apache Hadoop
* mahout -> Library for machine learning and data mining on top of
Apache Hadoop
* oozie -> workflow scheduling
* pig -> High level language to process data
* solr -> search platform
* sqoop -> Tool to transfert data between Apache Hadoop and relational
databases
* whirr -> set of libraries for running cloud services
* zookeeper -> server which enables highly reliable distributed coordination
* HCatalog -> In process of integration into Apache Bigtop. This is a
service to manage data's metadata
All these projects have packages for Debian/Ubuntu/SLES11/Fedora/CentOS.
They also have tests to exercise integration points between all of them
(ex: Hive can use HBase which sits on top of HDFS). And in order to run
these tests, we also have a test framework.
Also before we can test for integration, we also have to ensure they can
be properly installed/upgraded/removed, with the right users, ulimits,
rights and so forth. So to that end, we also have a large chunk of the
tests and testing framework dedicated to testing the packages themselves.
And finally, regarding the deployment, we have the following:
* Boxgrinder recipe people can use and modify to suit their need
* kickstart file to build a live fedora image
* puppet recipes to deploy all these services. I am not sure if
committed it, but I also had some puppet recipes which would
install/setup and integrate all these services with ganglia and nagios
automatically. We routinely use these recipes to automatically deploy a
cluster on ec2, run tests and get tests reports
From my experience with these projects, some of the pain points fedora
would have in integrating these packages are:
* None of them check for openjdk compatibility. They would welcome
patches, they would love to support openjdk, but no one has had the time
or resources to ensure compatibility. Same apply for (open)JDK 7
* All these projects move fast and can have quite a bit of dependencies
(which can also change over time). Note also that most of them use maven
(so at least it would be uniform). In Apache Bigtop we side stepped this
issue by not packaging dependencies separately. We had to make a choice
between supporting more distributions or packaging the dependencies and
we picked the former.
* All these projects just ask users to disable selinux. So there is no
integration with selinux at this time
* Apache Hadoop also had issues with ipv6. So they used to ask users to
disable it. I am not sure if this is still true.
I hope this gives some overview, but feel free to ask any question.
Thanks,
Bruno
More information about the bigdata
mailing list