[Event Report] RootConf 2016 :: Learning From Failure

Wednesday, 20 April 2016

=============================
RootConf 2016 :: Learning From Failure
=============================
Blog  ::
---------
https://medium.com/@ramkrsna/rootconf-2016-learning-from-failure-241f3767...

Report
----------
RootConf, https://rootconf.in/2016/ is an annual conference in
Bangalore, India organised by the wonderful folks at HasGeek
https://hasgeek.com/. It was held on the 14th and 15th of April of
2016. rootconf is dedicated to #devops in general but had folks all
streams of the locally thriving IT and StartUp industry. This was my
first rootconf, although I have attended the previous conferences
organized by HasGeek like The Fifth Elephant
https://fifthelephant.in/2016/, JSfoo https://jsfoo.in/2015/ . HasGeek
gets “Hacker Culture”
http://www.catb.org/jargon/html/introduction.html, right from
distributing CAT-5e UTP LAN cables as Lanyards to giving out Yubikeys
to every participants in a very professionally packed sponsor kit to
having a well defined code of conduct
https://rootconf.in/2016/code-of-conduct . This years conference had
over 250+ folks attending the conference.

We also had a bunch of volunteers from the Fedora India,
https://lists.fedoraproject.org/admin/lists/india.lists.fedoraproject.org/
Project participating in the conference. Fedora Project was a
community sponsor of the conference. As a systems engineer most part
of my career dealing with Application Binary Interfaces
https://access.redhat.com/articles/rhel-abi-compatibility and writing
tooling around Software Assurance https://access.redhat.com/ecosystem
being close to the platform, this was a good learning how the
developer operations and system administrators tackle operating system
constraints with popular OpenSource Solutions.

This years theme was about “Learning from failure” from devops who
face these issues day in and out. Most of the talks stuck to the theme
, while a few patterns related to architecture, people/process hacks
and devops emerged. Talks about failures
https://azure.microsoft.com/en-us/blog/final-root-cause-analysis-and-impr...
were educational and were also narrative stories
https://aws.amazon.com/message/67457/ which many developers and devops
could relate to. Interestingly, The most projects used in production
by the speakers talked about did not come from a Red Hat, Oracle or a
Canonical, but rather companies like ClusterHQ, HashiCorp, Twitter,
LinkedIn, Etsy, Netflix. Products like kafka, zookeeper, flume, mesos,
etcd, serf, chaos monkey, statsd and many more which just work in
distributed production environments. So without going into specifics
of each talk, these were general gist of the talks and discussions
around the Fedora Booth. Video’s of all the talks are being uploaded
to HasGeek TV :: https://hasgeek.tv/rootconf/2016

Patterns ::
-------------
1. Failure and Embracing Risk .
2. Truce between developers, devops and System Administrators defines
the culture irrespective of size of your company.
3. Pay attention to Configurations, Error handling and Monitoring.

Quote ::
------------
" We can generalize to economic growth. The problem is that these
discussions of “growth” are made by people who have never taken risks
" -Nassim Nicholas Taleb Source ::
https://www.facebook.com/nntaleb/posts/10153701042473375 "

Pretty much everything is distributed these days including your truly
favourite project of the month, my personal take on it is no matter
what you do you always have to take a trade off between Stability vs
Agility . In my experience most of the outages happen mostly due to
Configuration bugs and improper Error handling. Yes, programming
languages and language extensions do matter to an extent if and only
if you have verified its proper usage, but understanding your hardware
machine is important. I would emphasize on asking the developers to
speed up the code, once such pattern is although could be through
identifying the bottleneck like CPU, local I/O in disk, external
resources. Understanding RAM, Eg, avoiding random reads and access
sequentially could speed up . One you could also use CPU vector
instructions. This video by Ulrich Drepper is a good introduction on
what I mean by CPU utilization https://youtu.be/DXPfE2jGqg0

Speed up your code on a single node by understanding the hardware you
have invested , simple efficient code can be amazing competitive. 10x
faster could translate to 10x less servers to invest in. Clusterize in
the end if needed, clusters are hard and should be done as a last
resort. This paper on : Simple Testing Can Prevent Most Critical
Failures: An Analysis of Production Failures in Distributed
Data-Intensive Systems is a great read on proper error handling.
Having a good process on test and staging environment is vital
irrespective of size of the organization .
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/...

Fedora Project, CentOS, Storage and Project Atomic
-----------------------------------------------------------------------------

Audience that visited the fedora booth, were curious about Fedora
Atomic Workstation and Ansible right after kushal’s talk on fedora
cloud , Containers on CentOS, persistent storage in containers and
OpenShift .
They were quite curious about Project Atomic because of the stickers,
I was quite surprised that many in the devops community did not know
about project atomic . We spoke about the contributions to the
Kubernetes Project, Docker and also OpenContainer Initiative . For the
folks who asked about the Atomic Host here are the links ::

- Atomic Host for Fedora :: https://getfedora.org/en/cloud/download/atomic.html
- Atomic Host for CentOS :: https://wiki.centos.org/SpecialInterestGroup/Atomic
- Atomic Host for RHEL ::
https://access.redhat.com/documentation/en/red-hat-enterprise-linux-atomi...
- For the curious minds, you could also check the fedora effort around
the layered docker build services .

Pictures from the Event ::
--------------------------------------

Here are a few pictures from the conference hosted on Flickr ::
https://www.flickr.com/photos/ramkrsna/albums/72157665033034013

Swag ::
-----------
The fedora buttons were an instant hit, for some reason folks wanted
more of buttons. The DVD’s on the other hand were frowned upon by a
few, although a couple of students had picked up a few. I guess it has
a lot to do with the fact that most of the laptops do not ship with
them any more, few of them requested USB’s hopefully we do have some
budget for it the next time around.

I would like to thank the Fedora India community, Red Hat India for
supporting me to get to this conference.

Books and Papers and further reading ::
----------------------------------------------------------
Drift into failure
http://www.amazon.com/Drift-into-Failure-Sidney-Dekker/dp/1409422216/ref=...
How complex systems fail
http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
Notes on distributed systems for young blood.
https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-...
http://www.grpc.io/ A high performance, open source, general RPC
framework that puts mobile and HTTP/2 first.
Revisiting Distributed Synchronous SGD , http://arxiv.org/pdf/1604.00981v2
What Bugs live in the Cloud , http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf
Maglev :: A fast reliable network Load Balancer
https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenb...
Paxos Quorum Leases: Fast Reads Without Sacrificing Writes
http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-14-105.pdf

regards

-- 
Ramakrishna Reddy                                               GPG
Key ID:31FF0090
Fingerprint =  18D7 3FC1 784B B57F C08F  32B9 4496 B2A1 31FF 0090

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007