Thoughts on Deltacloud monitoring - now called Spectre - deltacloud-devel - Fedora mailing-lists

24 Sep 2009


      OK, so I've been mulling this over after some talks with Jay last week.
Thought I'd put something in writing.  This isn't intended to be a solution,
rather the intent is to capture thoughts on what may be needed for the design
and then we can discuss the best implementation.
It is not clear to me where to draw the line for the component responsible for
various pieces of functionality.  I will list *most everything* I can think of here
and the ensuing discussions can help sort out where that functionality should
lie.
In lieu of hard requirements, I have listed some assumptions that I'm starting
with to frame the discussion.  If you feel that any of these are incorrect,
please speak up so we don't spend too much time on bad assumptions.
Assumptions
-----------
1) For the sake of this discussion, I am listing a lot of functionality
    that may not be in the initial release.  Best to consider it now and make
    sure the design can handle it.  Also some of this function will most likely
    not end up in the monitoring component, but the functionality needs to
    be discussed, if for no reason other to explain what the trade offs are...
2) the use of this monitoring data will need to cover one to many users.
    a) There is no real upper bound to the number of users at this point in
       time.   However we should decide on the order of magnitude (1K, 100K, etc)
    b) must be able to provide data to each individual user concurrently.
    c) Must provide the capability to arrange users and aggregate data in
       groups and in a hierarchical manner. (This would an example of
       I am stating some functionality that some Deltacloud  component should
       provide,  I mention it here as it could impact the design even if the
       "ownership" ends up elsewhere. Best to decide "where does this
       functionality belong?" in this discussion and understand its ramifications.)
3) Billing does not imply a simple end of the month run, but rather the
    ability to monitor usage and calculate running "balance" (but does
    not need to consider payments outside of "reset" functionality.)
    a) down the road we will want to be able to provide some enforcement
       of spending (Jay is only budgeted to spend $100, don't let him
       go over it)
    As Jay has pointed out, this is moist likely outside the scope of Spectre,
    except for:(but this is why we need to understand the entire problem)
     1. Spectre needs to collect whatever data portal requires for billing
     2. Archiving of this information beyond the time a provider may do so
        needs to be accounted for by Spectre and configurable per provider
4) Data collection from the various clouds will be a "pull" event. In my
    mind, pull is preferred because it allow the app to control the load it
    generates.  With a push model, it could leave to server overload.
    a) It also needs to be dynamic.  So as the user adds additional VM's
       the data for those also is gathered.  So "poll" is still technically
       what is happening,we just need a dynamic polling mechanism.
5) For monitoring, data will be collected on all "active" Deltacloud users.
    Here "active" user implies that the user is logged in; or someone is
    monitoring the group that the user is part of.
    a) This is just a "stake in the ground. We really should investigate
       whether its more efficient to get all the data all the time or only
       as needed.
    b) if we only get data as needed, we will need some extra logic to
       determine if there are "holes" in the data and catch up as needed.
    ** Jay has mentioned that some flexibility should be built in here, there
    may be providers who charge for accessing the data based on the access
    rate.
6) In general. I can see a need for "business logic" specific to
    each vendors Cloud.  This will include billing rates, types of data
    available, etc.  I'm wondering if there is a "global" way to handle
    this in Deltacloud already...
    a) At some level, the data retrieval / monitoring aspects will need to
       have cloud specific knowledge. For instance, CloudWatch keeps the data
       for two weeks. Exactly where in the design this information needs to be
       kept still needs to be determined.
    My main goal for thinking of a business layer is to allow for a vendor
    specific "modules" to be created and used w/o any changes to the underlying
    stats layer.
7) It is assumed that any Red Hat cloud product will be treated like JAC
    (Just Another Cloud).
8) Users, this could be one of the more complicated pieces to get right.
    My initial thought is a single user can have many different cloud
    associations. However its not clear if we need to track the case where
    a single cloud account could be shared between users and we need to account
    for each users usage. Again, please keep in mind that this document
    is intended to cover all the functionality we see down the road, so
    it may not an initial release goal but we may need to plan for it.
So my thoughts are that the user management tracking, etc is done under
    some other Deltacloud component and Spectre will just need to be able
    store and retrieve data based on some type of unique user / cloud
    identifier. (hey, its hard to pull implementation out of design)
However we still need to define things like a data request. Should we
    require the caller to be very specific about the user cloud combinations?
    Ans lots of other stuff like that...
9) Spectre will need to be able to provide data to a caller for a specific cloud
    in a manner that is more efficient than recursively walking a list
    of users. Thus Spectre must be able to retrieve data based on user or
    cloud or a combination of both.
10) I would expect that the Spectre functionality is provided in a manner
    that would allow for it to be distributed, federated or as a "module"
    to some other application. My initial thought would be a web service
    type of architecture, but lets not ahead of ourselves...
Straw man proposal (http://en.wikipedia.org/wiki/Straw_man_proposal)
------------------
In general, we need to define something that stores and retrieves data,
that has clear APIs for both the top and bottom layers. For this
discussion, I would propose that we view Spectre with three distinct layers:
1) On the "top" level, we need to define methods for data insertion and retrieval.
2) The middle layer is basically the traffic cop, it is also the least defined
3) The bottom level provides the interaction to the data store(s). For the
    bottom level, we need to define an API that we will call to interact
    with the data store(s).
I will dive a little deeper into these layers below.
I am assuming that specific interfaces for both the top and bottom layers
will need to be created in order to interface to different cloud vendors
or data stores. For the sake of discussion, I will refer to these as
"modules" although the intent is to help facilitate the discussion,
not to imply a specific implementation.
One of the main goals of the resultant architecture is that a new vendor
specific module can be created for either the top or bottom layers without
impacting design / code of the middle Stats layer.
Lets try this bottom up.
-------------------------
The goal of defining an API at the bottom this stack is to allow for different
data stores to be used.  By clearly defining the API, anyone should be able to
write their own interface layer to the data store. Examples of data stores would
be RRD, memcache, mySQL, etc.
This would seem to imply that the data store specific module be able to translate
a request for "Marks EC2 usage data from last week" into the specific language
needed to access its data store.  It will need to format the response from its
data store back into a common (but yet to be defined) intermediate data format
and passed "up the stack".
Configuration data (User credentials, db name, directory structure, etc. )
- Two ways to go here.
a) Each data store module at the should be responsible for the required
    configuration information. This should be loaded by the module when it is
    initialized. In other words, the data service should not need to know any
    of the details.
b) Alternatively, the API should include a "required" set of calls that the
    service can call to retrieve the configuration data required and store
    the information. (The module indicates it needs dbname, userid, pwd, etc)
    This way the overall service would still be used for controlling the
    configuration.  The parameters would need to be supplied by the module so
    that the data service could remain data store agnostic.
Not sure if we should allow the use of multiple modules simultaneously. It
would be nice to use a memcache-like mechanism for some things to avoid more
expensive lookups....  This could be something like a write through cache.
Lots to discuss...
Middle Layer
------------
This is the traffic cop of the data service.  It could be fairly lightweight,
just taking input, validating it and  and passing it through.  For data
requests, it could some coalescing of requests; provide the ability to
look in a local data cache, etc.
One major area that will need to be looked at is security.  This layer
would seem to be the location where any security would be implemented.
Not sure how much we want or need.
If this layer does any work on data, it will be the "intermediate format"
Top Layer
---------
This is layer that is called to store or retrieve data. My initial thoughts
are that while these two types of functionality are at the same level, they
are drastically different in what they do so I'll treat them separately.
Data Input API ("mystery data collection module")
--------------
This provides an API for storing data in the data service. It will be called
by the cloud specific module. My thoughts are that these modules will be
used to pull data from the cloud and push it into the data store.
It is the responsibility of the module to take the data from the cloud
and translate it into the intermediate data format.
It should be possible for many different modules to access this API in
parallel.
Data Retrieval
---------------
There are clearly needs to provide data back to a caller in different formats.
This API will need to support that.  I think the main design decision would
be how to structure this.  I am almost thinking that sticking with the module
design will work well.  This will allow end users to add there own modules.  It
will also allow for mechanisms to allow more levels of data processing to be
added w/o polluting the main API. For instance, you could create a module to
compute rolling averages.
So this layer will need to some thought to choose the right solution for future
needs and maintainability. (Hint, its easier to drop support for a module than to
change the main API down the road...)
This level must be able to translate a request for "Marks EC2 usage data
from last week" into the intermediate language.
It should go w/o saying that this API must support concurrent access.
Higher Level questions
-----------------------
So some design questions that will hopefully lead us to pick the right solution...
1) do we need to provide synch APIs, asynch APIs or both ?
2) do we support a data stream vs "one shot" (for instance, do we
provide a call to allow a continuous stream of data in or out Spectre?
3) how long should we be storing data ?
Next Steps
-------------
1) Start discussions based on the above content
2) Identify vendors and investigate the requirements for getting data from
different clouds.  (EC2, vmware, RHEV-M, rackspace ?)
3) look at high level questions, build requirements.