Re: [deltacloud-devel] [Fwd: Thoughts on Deltacloud monitoring - now called Spectre] - deltacloud-devel - Fedora mailing-lists

24 Sep 2009

      ...
OK, so I've been mulling this over after some talks with Jay last week.
Thought I'd put something in writing.  This isn't intended to be a
solution,
rather the intent is to capture thoughts on what may be needed for the
design
and then we can discuss the best implementation.
It is not clear to me where to draw the line for the component responsible
for
various pieces of functionality.  I will list *most everything* I can think
of here
and the ensuing discussions can help sort out where that functionality
should
lie.
In lieu of hard requirements, I have listed some assumptions that I'm
starting
with to frame the discussion.  If you feel that any of these are incorrect,
please speak up so we don't spend too much time on bad assumptions.
Assumptions

For the sake of this discussion, I am listing a lot of functionality
that may not be in the initial release.  Best to consider it now and

make
   sure the design can handle it.  Also some of this function will most
likely
   not end up in the monitoring component, but the functionality needs to
   be discussed, if for no reason other to explain what the trade offs
are...

the use of this monitoring data will need to cover one to many users.
a) There is no real upper bound to the number of users at this point in
   time.   However we should decide on the order of magnitude (1K, 100K,

etc)
   b) must be able to provide data to each individual user concurrently.
   c) Must provide the capability to arrange users and aggregate data in
      groups and in a hierarchical manner. (This would an example of
      I am stating some functionality that some Deltacloud  component
should
      provide,  I mention it here as it could impact the design even if the
      "ownership" ends up elsewhere. Best to decide "where does this
      functionality belong?" in this discussion and understand its
ramifications.)
[jg] I believe this one is outside the scope of spectre, and would be
handled
by pools/users/permissions in deltacloud portal (though perhaps there
could be some simplified representation here of this, if it makes sense.)
I agree that it may have an impact on our design here though.
...

Billing does not imply a simple end of the month run, but rather the
ability to monitor usage and calculate running "balance" (but does
not need to consider payments outside of "reset" functionality.)
a) down the road we will want to be able to provide some enforcement
   of spending (Jay is only budgeted to spend $100, don't let him
   go over it)
As Jay has pointed out, this is moist likely outside the scope of

Spectre,
   except for:(but this is why we need to understand the entire problem)
    1. Spectre needs to collect whatever data portal requires for billing
    2. Archiving of this information beyond the time a provider may do so
       needs to be accounted for by Spectre and configurable per provider

Data collection from the various clouds will be a "pull" event. In my
mind, pull is preferred because it allow the app to control the load it
generates.  With a push model, it could leave to server overload.

[jg] I had not considered this as a potential issue.  There had been some
talk of
hoping for push to be supported by some vendors in the future.  Maybe it is
just my web background, but polling feels less efficient to me.
a) It also needs to be dynamic.  So as the user adds additional VM's
...
  the data for those also is gathered.  So "poll" is still technically
  what is happening,we just need a dynamic polling mechanism.

[jg] Some kind of trigger may be a good idea, but could we not also get new
vms and such on the next scheduled 'poll' of information?
5) For monitoring, data will be collected on all "active" Deltacloud users.
...
Here "active" user implies that the user is logged in; or someone is
   monitoring the group that the user is part of.
   a) This is just a "stake in the ground. We really should investigate
      whether its more efficient to get all the data all the time or only
      as needed.
   b) if we only get data as needed, we will need some extra logic to
      determine if there are "holes" in the data and catch up as needed.
   ** Jay has mentioned that some flexibility should be built in here,
there
   may be providers who charge for accessing the data based on the access
   rate.

In general. I can see a need for "business logic" specific to
each vendors Cloud.  This will include billing rates, types of data
available, etc.  I'm wondering if there is a "global" way to handle
this in Deltacloud already...
a) At some level, the data retrieval / monitoring aspects will need to
   have cloud specific knowledge. For instance, CloudWatch keeps the

data
      for two weeks. Exactly where in the design this information needs to
be
      kept still needs to be determined.
   My main goal for thinking of a business layer is to allow for a vendor
   specific "modules" to be created and used w/o any changes to the
underlying
   stats layer.
[jg]  Agreed, I see this potentially varying a vast amount from one provider
to the next.  However, I think the business logic for spectre should be
limited to retrieving the information available, with any
provider-specific logic being more for aggregation or other data views
(though perhaps there could be provider-specific modules on the retrieval
side of the api).
Billing I see being handled in the portal (or possibly even the
framework, since that has provider-specific drivers also, so it could
encapsulate billing logic in a common api that the portal could use).
...

It is assumed that any Red Hat cloud product will be treated like JAC
(Just Another Cloud).

...

Users, this could be one of the more complicated pieces to get right.
My initial thought is a single user can have many different cloud
associations. However its not clear if we need to track the case where
a single cloud account could be shared between users and we need to

account
   for each users usage. Again, please keep in mind that this document
   is intended to cover all the functionality we see down the road, so
   it may not an initial release goal but we may need to plan for it.
So my thoughts are that the user management tracking, etc is done under
   some other Deltacloud component and Spectre will just need to be able
   store and retrieve data based on some type of unique user / cloud
   identifier. (hey, its hard to pull implementation out of design)
[jg]To help clarify (and make sure I am right), I believe with what we have
so far on
the portal side of the design, we would have cloud account X.  As far as the
provider
is concerned, X is the only account.  Portal, however, may have N accounts
associated
with X.  Each of these accounts can have 0 or more VMs (and possibly other
things), so
on the monitoring side, if we can find things based on account X and some
identifier for
the actual item requested, I think that will get us what we need w/o
building in too much
business logic that is already handled elsewhere.
...
However we still need to define things like a data request. Should we
   require the caller to be very specific about the user cloud
combinations?
   Ans lots of other stuff like that...

Spectre will need to be able to provide data to a caller for a specific

cloud
   in a manner that is more efficient than recursively walking a list
   of users. Thus Spectre must be able to retrieve data based on user or
   cloud or a combination of both.
...

I would expect that the Spectre functionality is provided in a manner

that would allow for it to be distributed, federated or as a "module"
   to some other application. My initial thought would be a web service
   type of architecture, but lets not ahead of ourselves...
[jg] Agreed on both points.
...
Straw man proposal (http://en.wikipedia.org/wiki/Straw_man_proposal)
In general, we need to define something that stores and retrieves data,
that has clear APIs for both the top and bottom layers. For this
discussion, I would propose that we view Spectre with three distinct
layers:

On the "top" level, we need to define methods for data insertion and

retrieval.
2) The middle layer is basically the traffic cop, it is also the least
defined
3) The bottom level provides the interaction to the data store(s). For the
   bottom level, we need to define an API that we will call to interact
   with the data store(s).
I will dive a little deeper into these layers below.
I am assuming that specific interfaces for both the top and bottom layers
will need to be created in order to interface to different cloud vendors
or data stores. For the sake of discussion, I will refer to these as
"modules" although the intent is to help facilitate the discussion,
not to imply a specific implementation.
One of the main goals of the resultant architecture is that a new vendor
specific module can be created for either the top or bottom layers without
impacting design / code of the middle Stats layer.
Lets try this bottom up.
The goal of defining an API at the bottom this stack is to allow for
different
data stores to be used.  By clearly defining the API, anyone should be able
to
write their own interface layer to the data store. Examples of data stores
would
be RRD, memcache, mySQL, etc.
[jg]  I think I agree with this sentiment, but believe you are proposing
one
level 'down' from what I originally thought.  So, we already have the
concept of provider 'drivers' for collecting the info.  Lets say, for
the sake of argument that we extended our existing 'stats' module to
include write functionality.  I think the driver would then call stats
to save whatever it had collected (maybe too implementation-specific,
but not sure how else to describe).  Stats in turn would call the actual
'save' against whatever datastore is enabled (this being the additional
layer I think you are describing).  Is this what you meant?
This would seem to imply that the data store specific module be able to
...
translate
a request for "Marks EC2 usage data from last week" into the specific
language
needed to access its data store.  It will need to format the response from
its
data store back into a common (but yet to be defined) intermediate data
format
and passed "up the stack".
[jg] I was not really thinking of multiple potential languages, mainly just
ruby,
but this seems like a reasonable requirement as long as we keep it as simple
as
possible so we don't get bogged down in implementation. First thought would
be
xml, as most languages can handle that.
...
Configuration data (User credentials, db name, directory structure, etc. )

Two ways to go here.

a) Each data store module at the should be responsible for the required
   configuration information. This should be loaded by the module when it
is
   initialized. In other words, the data service should not need to know
any
   of the details.
b) Alternatively, the API should include a "required" set of calls that the
   service can call to retrieve the configuration data required and store
   the information. (The module indicates it needs dbname, userid, pwd,
etc)
   This way the overall service would still be used for controlling the
   configuration.  The parameters would need to be supplied by the module
so
   that the data service could remain data store agnostic.
Not sure if we should allow the use of multiple modules simultaneously. It
would be nice to use a memcache-like mechanism for some things to avoid
more
expensive lookups....  This could be something like a write through cache.
Lots to discuss...
[jg] Initially I thought just one for storage, but upon further reflection,
it could be really
nice to be able to 'layer' this, imo.  So perhaps check a cache of some sort
first, then
move down to true storage layer if data is not found.
...
Middle Layer
This is the traffic cop of the data service.  It could be fairly
lightweight,
just taking input, validating it and  and passing it through.  For data
requests, it could some coalescing of requests; provide the ability to
look in a local data cache, etc.
One major area that will need to be looked at is security.  This layer
would seem to be the location where any security would be implemented.
Not sure how much we want or need.
If this layer does any work on data, it will be the "intermediate format"
Top Layer
This is layer that is called to store or retrieve data. My initial thoughts
are that while these two types of functionality are at the same level, they
are drastically different in what they do so I'll treat them separately.
[jg] One thought, since this is what everything would call, if we have auth
in
the middle layer, then top needs to accept credentials and pass them down
to middle layer.
Data Input API ("mystery data collection module")
...

This provides an API for storing data in the data service. It will be
called
by the cloud specific module. My thoughts are that these modules will be
used to pull data from the cloud and push it into the data store.
It is the responsibility of the module to take the data from the cloud
and translate it into the intermediate data format.
It should be possible for many different modules to access this API in
parallel.
Data Retrieval
There are clearly needs to provide data back to a caller in different
formats.
This API will need to support that.  I think the main design decision would
be how to structure this.  I am almost thinking that sticking with the
module
design will work well.  This will allow end users to add there own modules.
 It
will also allow for mechanisms to allow more levels of data processing to
be
added w/o polluting the main API. For instance, you could create a module
to
compute rolling averages.
So this layer will need to some thought to choose the right solution for
future
needs and maintainability. (Hint, its easier to drop support for a module
than to
change the main API down the road...)
This level must be able to translate a request for "Marks EC2 usage data
from last week" into the intermediate language.
It should go w/o saying that this API must support concurrent access.
Higher Level questions
So some design questions that will hopefully lead us to pick the right
solution...

do we need to provide synch APIs, asynch APIs or both ?

[jg] my inclination would be both.
...

do we support a data stream vs "one shot" (for instance, do we

provide a call to allow a continuous stream of data in or out Spectre?
[jg] I think this would be nice to allow, so all clients do not have to
poll.
...

how long should we be storing data ?

Next Steps

Start discussions based on the above content

Identify vendors and investigate the requirements for getting data from

different clouds.  (EC2, vmware, RHEV-M, rackspace ?)

look at high level questions, build requirements.

spectre-devel mailing list
spectre-devel@lists.fedorahosted.org
https://fedorahosted.org/mailman/listinfo/spectre-devel