Big Data SIG Meeting Minutes: 2013-03-07

Sun Mar 10 01:35:36 UTC 2013

On 03/07/2013 10:40 AM, Robyn Bergeron wrote:
> Howdy,
>
> Minutes from today's meeting follow below.
>
> Had a really interesting conversation - though I think we didn't quite
> get through much of the agenda :) We did discuss some categories,
> things already packaged, etc. - I'm on duty to make a sub-wiki-page so
> we can collectively document some of these things.
>
> Also had some talk around packaging Apache Bigtop as well as the
> intersections with that community - as many of their folks apparently
> use Fedora (YAY).
>
> Anyway: Keep your intros, thoughts coming - I suspect we'll be
> gathering more humans in the coming weeks.
>
> Thanks for coming!
>
> -Robyn
>
>
> Minutes: http://meetbot.fedoraproject.org/fedora-meeting-1/2013-03-07/big_data_sig.2013-03-07-16.59.html
> Full logs: http://meetbot.fedoraproject.org/fedora-meeting-1/2013-03-07/big_data_sig.2013-03-07-16.59.log.html
>

Hi,

I am Bruno and am very interested in cloud and big data technologies.

As promised (although a little bit later than planned), I am following 
up on the discussion from the irc meeting.

One of the project I work on intersects closely with this SIG. This 
project is Apache Bigtop (http://bigtop.apache.org/).
The goal of Apache Bigtop is three folds:
1/ Provide top notch packages for Apache Hadoop related projects
2/ Provide a point of integration and testing for all these projects
3/ Provide means to reliably deploy a complete stack.

Apache Bigtop was donated by Cloudera to the Apache Foundation and is 
now the upstream of CDH (Cloudera's distribution), ubuntu hadoop 
packages (https://launchpad.net/~hadoop-ubuntu) and HDP 1.X from 
Hortonworks (haven't checked 2.X and not sure to which extend they have 
been modified).

We also use a lot of fedora into Apache Bigtop and I believe this SIG 
and Apache Bigtop could benefit from each others at least in some areas.
For instance, Apache Bigtop provides live USB/CD images of Fedora with 
Apache Hadoop (and a bunch of other projects) pre-installed.
We also use boxgrinder to build Centos VMs from a fedora build slave.

Right now, the list of projects Apache Bigtop supports is:
* jsvc
* tomcat
* bigtop-utils -> Misc. tools, such asauto-detecting a JVM
* crunch -> Library for writing/testing/running MapReduce pipelines
* datafu -> Collection of libraries for pig
* flume -> distributed, reliable, and available service for efficiently 
collecting, aggregating, and moving large amounts of log data
* giraph -> Graph processing on top of Apache Hadoop
* hadoop
* hama -> Apache Hama is a pure BSP (Bulk Synchronous Parallel) 
computing framework
* hbase -> columnar database
* hive -> Hive is a data warehouse system for Hadoop that facilitates 
easy data summarization, ad-hoc queries, and the analysis of large 
datasets stored in Hadoop compatible file systems. Enables to run 
SQL-like queries
* hue -> browser-based desktop interface for interacting with Apache Hadoop
* mahout -> Library for machine learning and data mining on top of 
Apache Hadoop
* oozie -> workflow scheduling
* pig -> High level language to process data
* solr -> search platform
* sqoop -> Tool to transfert data between Apache Hadoop and relational 
databases
* whirr -> set of libraries for running cloud services
* zookeeper -> server which enables highly reliable distributed coordination
* HCatalog -> In process of integration into Apache Bigtop. This is a 
service to manage data's metadata

All these projects have packages for Debian/Ubuntu/SLES11/Fedora/CentOS.
They also have tests to exercise integration points between all of them 
(ex: Hive can use HBase which sits on top of HDFS). And in order to run 
these tests, we also have a test framework.
Also before we can test for integration, we also have to ensure they can 
be properly installed/upgraded/removed, with the right users, ulimits, 
rights and so forth. So to that end, we also have a large chunk of the 
tests and testing framework dedicated to testing the packages themselves.

And finally, regarding the deployment, we have the following:
* Boxgrinder recipe people can use and modify to suit their need
* kickstart file to build a live fedora image
* puppet recipes to deploy all these services. I am not sure if 
committed it, but I also had some puppet recipes which would 
install/setup and integrate all these services with ganglia and nagios 
automatically. We routinely use these recipes to automatically deploy a 
cluster on ec2, run tests and get tests reports

 From my experience with these projects, some of the pain points fedora 
would have in integrating these packages are:
* None of them check for openjdk compatibility. They would welcome 
patches, they would love to support openjdk, but no one has had the time 
or resources to ensure compatibility. Same apply for (open)JDK 7
* All these projects move fast and can have quite a bit of dependencies 
(which can also change over time). Note also that most of them use maven 
(so at least it would be uniform). In Apache Bigtop we side stepped this 
issue by not packaging dependencies separately. We had to make a choice 
between supporting more distributions or packaging the dependencies and 
we picked the former.
* All these projects just ask users to disable selinux. So there is no 
integration with selinux at this time
* Apache Hadoop also had issues with ipv6. So they used to ask users to 
disable it. I am not sure if this is still true.

I hope this gives some overview, but feel free to ask any question.

Thanks,
Bruno