This patch is a first try at using condor as a job management system. This removes the usage of the 'taskomatic' utilities and replaces them with 'condormatic' calls that use the command line interfaces (no qmf or gsoap etc) to condor.
On startup of the server (and any changes after running), a pile of 'classads' are created which define each possible startup location for a given set of image/hardware profiles that exist and are useable as well as the backend info condor needs to start an instance on the given provider.
For each instance that you start, a job will be created in condor. Condor will then match the hardware profile and image to a provider and can then start an instance on that provider. When you stop or destroy that instance, the job will be removed (which isn't really how we want it to go but..).
This patch requires that you have our custom hacked up condor installed. You can get this at:
http://people.redhat.com/clalance/condor-dcloud
Be sure to read the README. Chris has written up very good instructions on how to set up condor.
In general everything here basically works. There are however several known bugs and deficiencies:
- To 'stop' a job in condor we should be using 'hold' instead of removing the job. This is creating a few different problems. - After stopping an instance the condor job is removed but the instance continues to exist in deltacloud. On a subsequent 'start' the start fails. - I'm only matching on image and hardware profiles, not realms and I'm ignoring quotas too. - We are still reaching directly to the DeltaCloud API to get a list of available actions for each instance. Maybe this is fine, I'm not sure. - Classads are sync'd to condor on startup and on any changes to the hardware profile and image records. However, if you restart condor you won't have any classads in it to match against and your jobs will fail. - We're still using 'on-demand' syncing of states from condor to the aggregator. eg when you list the instances it updates the states of each instance from condor at that time. There is no event logging. - There's no 'reboot' as yet in condor. Not sure how we'll deal with that just yet. - We've kept the tasks model and usage but they are quazi-meaningless. The task table needs to turn into an event/audit log table.
Many of these problems have fixes in-progress or will be addressed in future patches.
Signed-off-by: Ian Main imain@redhat.com --- src/app/controllers/instance_controller.rb | 17 +- src/app/controllers/pool_controller.rb | 5 +- src/app/models/hardware_profile_observer.rb | 9 + src/app/models/image_observer.rb | 9 + src/app/util/condormatic.rb | 232 +++++++++++++++++++++ src/config/environment.rb | 2 +- src/config/initializers/condor_classads_sync.rb | 8 + src/db/migrate/20090804142049_create_instances.rb | 1 + 8 files changed, 275 insertions(+), 8 deletions(-) create mode 100644 src/app/models/hardware_profile_observer.rb create mode 100644 src/app/models/image_observer.rb create mode 100644 src/app/util/condormatic.rb create mode 100644 src/config/initializers/condor_classads_sync.rb
diff --git a/src/app/controllers/instance_controller.rb b/src/app/controllers/instance_controller.rb index 039ed3a..5664ec5 100644 --- a/src/app/controllers/instance_controller.rb +++ b/src/app/controllers/instance_controller.rb @@ -19,7 +19,7 @@ # Filters added to this controller apply to all controllers in the application. # Likewise, all the methods added will be available for all controllers.
-require 'util/taskomatic' +require 'util/condormatic'
class InstanceController < ApplicationController before_filter :require_user @@ -96,8 +96,7 @@ class InstanceController < ApplicationController :task_target => @instance, :action => InstanceTask::ACTION_CREATE}) if @task.save - task_impl = Taskomatic.new(@task,logger) - task_impl.instance_create + condormatic_instance_create(@task) flash[:notice] = "Instance added." redirect_to :controller => "pool", :action => 'show', :id => @instance.pool_id else @@ -124,8 +123,16 @@ class InstanceController < ApplicationController raise ActionError.new("#{action} cannot be performed on this instance.") end
- task_impl = Taskomatic.new(@task,logger) - task_impl.send "instance_#{action}" + case action + when 'stop' + condormatic_instance_stop(@task) + when 'destroy' + condormatic_instance_destroy(@task) + when 'start' + condormatic_instance_create(@task) + else + raise ActionError.new("Sorry, action '#{action}' is currently not supported by condor backend.") + end
alert = "#{@instance.name}: #{action} was successfully queued." flash[:notice] = alert diff --git a/src/app/controllers/pool_controller.rb b/src/app/controllers/pool_controller.rb index e687c0b..9d53862 100644 --- a/src/app/controllers/pool_controller.rb +++ b/src/app/controllers/pool_controller.rb @@ -20,6 +20,7 @@ # Likewise, all the methods added will be available for all controllers.
require 'util/taskomatic' +require 'util/condormatic'
class PoolController < ApplicationController before_filter :require_user @@ -36,8 +37,8 @@ class PoolController < ApplicationController #FIXME: clean this up, many error cases here @pool = Pool.find(params[:id]) require_privilege(Privilege::INSTANCE_VIEW,@pool) - # pass nil into Taskomatic as we're not working off a task here - Taskomatic.new(nil,logger).pool_refresh(@pool) + # Go to condor and sync the database to the real instance states + condormatic_instances_sync_states @pool.reload @instances = @pool.instances end diff --git a/src/app/models/hardware_profile_observer.rb b/src/app/models/hardware_profile_observer.rb new file mode 100644 index 0000000..c924bdb --- /dev/null +++ b/src/app/models/hardware_profile_observer.rb @@ -0,0 +1,9 @@ +class HardwareProfileObserver < ActiveRecord::Observer + + def after_save(hwp) + condormatic_classads_sync + end +end + +HardwareProfileObserver.instance + diff --git a/src/app/models/image_observer.rb b/src/app/models/image_observer.rb new file mode 100644 index 0000000..68a5b85 --- /dev/null +++ b/src/app/models/image_observer.rb @@ -0,0 +1,9 @@ +class ImageObserver < ActiveRecord::Observer + + def after_save(image) + condormatic_classads_sync + end +end + +ImageObserver.instance + diff --git a/src/app/util/condormatic.rb b/src/app/util/condormatic.rb new file mode 100644 index 0000000..7ec6e01 --- /dev/null +++ b/src/app/util/condormatic.rb @@ -0,0 +1,232 @@ +# +# Copyright (C) 2010 Red Hat, Inc. +# Written by Ian Main imain@redhat.com +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; version 2 of the License. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, +# MA 02110-1301, USA. A copy of the GNU General Public License is +# also available at http://www.gnu.org/copyleft/gpl.html. + +def condormatic_instance_create(task) + + begin + instance = task.instance + # FIXME: We should be using the realm name and matching it in condor. + realm = instance.realm.external_key rescue nil + + job_name = "job_#{instance.name}_#{instance.id}" + + + # I use the 2>&1 to get stderr and stdout together because popen3 does not support + # the ability to get the exit value of the command in ruby 1.8. + pipe = IO.popen("condor_submit 2>&1", "w+") + pipe.puts "universe = grid\n" + Rails.logger.info "universe = grid\n" + pipe.puts "executable = #{job_name}\n" + Rails.logger.info "executable = #{job_name}\n" + pipe.puts "grid_resource = dcloud $$(provider_url) $$(username) $$(password) $$(image_key) #{instance.name} NULL $$(hardwareprofile_key)\n" + Rails.logger.info "grid_resource = dcloud $$(provider_url) $$(username) $$(password) $$(image_key) #{instance.name} NULL $$(hardwareprofile_key)\n" + pipe.puts "log = #{job_name}.log\n" + Rails.logger.info "log = #{job_name}.log\n" + pipe.puts "requirements = hardwareprofile == "#{instance.hardware_profile.id}" && image == "#{instance.image.id}"\n" + Rails.logger.info "requirements = hardwareprofile == "#{instance.hardware_profile.id}" && image == "#{instance.image.id}"\n" + pipe.puts "notification = never\n" + Rails.logger.info "notification = never\n" + pipe.puts "queue\n" + Rails.logger.info "queue\n" + pipe.close_write + out = pipe.read + pipe.close + + Rails.logger.info "$? (return value?) is #{$?}" + raise ("Error calling condor_submit: #{out}") if $? != 0 + + instance.condor_job_id = job_name + instance.save! + + rescue Exception => ex + task.state = Task::STATE_FAILED + Rails.logger.error ex.message + Rails.logger.error ex.backtrace + else + # FIXME: We're kinda lying here.. we don't know the state for the task but I don't think that matters so much + # as we are just going to use the 'task' table as a kind of audit log. + task.state = Task::STATE_PENDING + end + task.instance.save! +end + +# JobStatus for condor jobs: +# +# 0 Unexpanded U +# 1 Idle I +# 2 Running R +# 3 Removed X +# 4 Completed C +# 5 Held H +# 6 Submission_err E +# + +def condor_to_instance_state(state_val) + case state_val + when '0' + return Instance::STATE_PENDING + when '1' + return Instance::STATE_PENDING + when '2' + return Instance::STATE_RUNNING + when '3' + return Instance::STATE_STOPPED + when '4' + return Instance::STATE_STOPPED + when '5' + return Instance::STATE_CREATE_FAILED + when '6' + return Instance::STATE_CREATE_FAILED + else + return Instance::STATE_PENDING + end +end + +def condormatic_instances_sync_states + + begin + # I'm not going to do the 2&>1 trick here since we are parsing the output + # and I'm afraid we'll get a warning or something on stderr and it'll mess + # up the xml parsing. + pipe = IO.popen("condor_q -xml") + xml = pipe.read + pipe.close + + raise ("Error calling condor_q -xml") if $? != 0 + + # Set them all to 'stopped' because if they aren't in the condor + # queue as jobs then they are not running, pending or anything else. + instances = Instance.find(:all) + instances.each do |instance| + instance.state = Instance::STATE_STOPPED + instance.save! + end + + def find_value_int(job_ele, attrib) + if job_ele.attributes['n'] == attrib + cmd = job_ele.elements.each('i') do |i| + return i.text + end + end + return nil + end + + def find_value_str(job_ele, attrib) + if job_ele.attributes['n'] == attrib + cmd = job_ele.elements.each('s') do |s| + return s.text + end + end + return nil + end + + doc = REXML::Document.new(xml) + doc.elements.each('classads/c') do |jobs_ele| + job_name = nil + job_state = nil + + jobs_ele.elements.each('a') do |job_ele| + value = find_value_str(job_ele, 'Cmd') + job_name = value if value != nil + value = find_value_int(job_ele, 'JobStatus') + job_state = value if value != nil + end + + Rails.logger.info "job name is #{job_name}" + Rails.logger.info "job state is #{job_state}" + + instance = Instance.find(:first, :conditions => {:condor_job_id => job_name}) + + if instance + instance.state = condor_to_instance_state(job_state) + instance.save! + Rails.logger.info "Instance state updated to #{condor_to_instance_state(job_state)}" + end + end + rescue Exception => ex + Rails.logger.error ex.message + Rails.logger.error ex.backtrace + end +end + +def condormatic_instance_stop(task) + instance = task.instance + + Rails.logger.info("calling condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + pipe = IO.popen("condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + out = pipe.read + pipe.close + + Rails.logger.info("condor_rm return status is #{$?}") + Rails.logger.error("Error calling condor_rm (exit code #{$?}) on job: #{out}") if $? != 0 +end + +def condormatic_instance_destroy(task) + instance = task.instance + + Rails.logger.info("calling condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + pipe = IO.popen("condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + out = pipe.read + pipe.close + + Rails.logger.info("condor_rm return status is #{$?}") + Rails.logger.error("Error calling condor_rm (exit code #{$?}) on job: #{out}") if $? != 0 +end + + +def condormatic_classads_sync + + index = 0 + providers = Provider.find(:all) + Rails.logger.info "Syncing classads.." + + providers.each do |provider| + provider.cloud_accounts.each do |account| + provider.images.each do |image| + provider.hardware_profiles.each do |hwp| + pipe = IO.popen("condor_advertise UPDATE_STARTD_AD 2>&1", "w+") + + pipe.puts "Name="provider_combination_#{index}"" + pipe.puts 'MyType="Machine"' + pipe.puts 'Requirements=true' + pipe.puts "\n# Stuff needed to match:" + pipe.puts "hardwareprofile="#{hwp.aggregator_hardware_profiles[0].id}"" + pipe.puts "image="#{image.aggregator_images[0].id}"" + pipe.puts "\n# Backend info to complete this job:" + pipe.puts "image_key="#{image.external_key}"" + pipe.puts "hardwareprofile_key="#{hwp.external_key}"" + pipe.puts "provider_url="#{account.provider.url}"" + pipe.puts "username="#{account.username}"" + pipe.puts "password="#{account.password}"" + pipe.close_write + + out = pipe.read + pipe.close + + Rails.logger.error "Unable to submit condor classad: #{out}" if $? != 0 + + index += 1 + end + end + end + + Rails.logger.info "done" + end +end + diff --git a/src/config/environment.rb b/src/config/environment.rb index 919a710..eb11f17 100644 --- a/src/config/environment.rb +++ b/src/config/environment.rb @@ -50,7 +50,7 @@ Rails::Initializer.run do |config| config.gem "gnuplot" config.gem "scruffy"
- config.active_record.observers = :instance_observer, :task_observer + config.active_record.observers = :instance_observer, :task_observer, :hardware_profile_observer, :image_observer # Only load the plugins named here, in the order given. By default, all plugins # in vendor/plugins are loaded in alphabetical order. # :all can be used as a placeholder for all plugins not explicitly named diff --git a/src/config/initializers/condor_classads_sync.rb b/src/config/initializers/condor_classads_sync.rb new file mode 100644 index 0000000..9165f75 --- /dev/null +++ b/src/config/initializers/condor_classads_sync.rb @@ -0,0 +1,8 @@ +require 'util/condormatic' + +puts "Syncing condor classads.." +# This pulls all the possible classad matches from the database and puts +# them on condor on startup. +condormatic_classads_sync +puts "Done." + diff --git a/src/db/migrate/20090804142049_create_instances.rb b/src/db/migrate/20090804142049_create_instances.rb index 335b93f..42706e1 100644 --- a/src/db/migrate/20090804142049_create_instances.rb +++ b/src/db/migrate/20090804142049_create_instances.rb @@ -32,6 +32,7 @@ class CreateInstances < ActiveRecord::Migration t.string :public_address t.string :private_address t.string :state + t.string :condor_job_id t.integer :lock_version, :default => 0 t.integer :acc_pending_time, :default => 0 t.integer :acc_running_time, :default => 0
This patch is a first try at using condor as a job management system. This removes the usage of the 'taskomatic' utilities and replaces them with 'condormatic' calls that use the command line interfaces (no qmf or gsoap etc) to condor.
On startup of the server (and any changes after running), a pile of 'classads' are created which define each possible startup location for a given set of image/hardware profiles that exist and are useable as well as the backend info condor needs to start an instance on the given provider.
For each instance that you start, a job will be created in condor. Condor will then match the hardware profile and image to a provider and can then start an instance on that provider. When you stop or destroy that instance, the job will be removed (which isn't really how we want it to go but..).
This patch requires that you have our custom hacked up condor installed. You can get this at:
http://people.redhat.com/clalance/condor-dcloud
Be sure to read the README. Chris has written up very good instructions on how to set up condor.
REVISED: This patch plus the new condor fixes a number of the bugs that were in the previous patch. This patch adds realm matching support and fixes the start/stop issues we were seeing. So most things basically work now and I think it's generally useable. Probably the biggest outstanding bug for useability is that we do not keep long-running jobs for stateful instances.
The outstanding bugs are now limited to:
- To 'stop' a job in condor we should be using 'hold' instead of removing the job. This is creating a few different problems. - We are still reaching directly to the DeltaCloud API to get a list of available actions for each instance. Maybe this is fine, I'm not sure. - Quotas are not yet implemented. - Classads are sync'd to condor on startup and on any changes to the hardware profile and image records. However, if you restart condor you won't have any classads in it to match against and your jobs will fail. - We're still using 'on-demand' syncing of states from condor to the aggregator. eg when you list the instances it updates the states of each instance from condor at that time. There is no event logging. - There's no 'reboot' as yet in condor. Not sure how we'll deal with that just yet. - We've kept the tasks model and usage but they are quazi-meaningless. The task table needs to turn into an event/audit log table.
Signed-off-by: Ian Main imain@redhat.com --- src/app/controllers/instance_controller.rb | 17 +- src/app/controllers/pool_controller.rb | 5 +- src/app/models/hardware_profile_observer.rb | 9 + src/app/models/image_observer.rb | 9 + src/app/util/condormatic.rb | 241 +++++++++++++++++++++ src/config/environment.rb | 2 +- src/config/initializers/condor_classads_sync.rb | 8 + src/db/migrate/20090804142049_create_instances.rb | 1 + 8 files changed, 284 insertions(+), 8 deletions(-) create mode 100644 src/app/models/hardware_profile_observer.rb create mode 100644 src/app/models/image_observer.rb create mode 100644 src/app/util/condormatic.rb create mode 100644 src/config/initializers/condor_classads_sync.rb
diff --git a/src/app/controllers/instance_controller.rb b/src/app/controllers/instance_controller.rb index 039ed3a..5664ec5 100644 --- a/src/app/controllers/instance_controller.rb +++ b/src/app/controllers/instance_controller.rb @@ -19,7 +19,7 @@ # Filters added to this controller apply to all controllers in the application. # Likewise, all the methods added will be available for all controllers.
-require 'util/taskomatic' +require 'util/condormatic'
class InstanceController < ApplicationController before_filter :require_user @@ -96,8 +96,7 @@ class InstanceController < ApplicationController :task_target => @instance, :action => InstanceTask::ACTION_CREATE}) if @task.save - task_impl = Taskomatic.new(@task,logger) - task_impl.instance_create + condormatic_instance_create(@task) flash[:notice] = "Instance added." redirect_to :controller => "pool", :action => 'show', :id => @instance.pool_id else @@ -124,8 +123,16 @@ class InstanceController < ApplicationController raise ActionError.new("#{action} cannot be performed on this instance.") end
- task_impl = Taskomatic.new(@task,logger) - task_impl.send "instance_#{action}" + case action + when 'stop' + condormatic_instance_stop(@task) + when 'destroy' + condormatic_instance_destroy(@task) + when 'start' + condormatic_instance_create(@task) + else + raise ActionError.new("Sorry, action '#{action}' is currently not supported by condor backend.") + end
alert = "#{@instance.name}: #{action} was successfully queued." flash[:notice] = alert diff --git a/src/app/controllers/pool_controller.rb b/src/app/controllers/pool_controller.rb index e687c0b..9d53862 100644 --- a/src/app/controllers/pool_controller.rb +++ b/src/app/controllers/pool_controller.rb @@ -20,6 +20,7 @@ # Likewise, all the methods added will be available for all controllers.
require 'util/taskomatic' +require 'util/condormatic'
class PoolController < ApplicationController before_filter :require_user @@ -36,8 +37,8 @@ class PoolController < ApplicationController #FIXME: clean this up, many error cases here @pool = Pool.find(params[:id]) require_privilege(Privilege::INSTANCE_VIEW,@pool) - # pass nil into Taskomatic as we're not working off a task here - Taskomatic.new(nil,logger).pool_refresh(@pool) + # Go to condor and sync the database to the real instance states + condormatic_instances_sync_states @pool.reload @instances = @pool.instances end diff --git a/src/app/models/hardware_profile_observer.rb b/src/app/models/hardware_profile_observer.rb new file mode 100644 index 0000000..c924bdb --- /dev/null +++ b/src/app/models/hardware_profile_observer.rb @@ -0,0 +1,9 @@ +class HardwareProfileObserver < ActiveRecord::Observer + + def after_save(hwp) + condormatic_classads_sync + end +end + +HardwareProfileObserver.instance + diff --git a/src/app/models/image_observer.rb b/src/app/models/image_observer.rb new file mode 100644 index 0000000..68a5b85 --- /dev/null +++ b/src/app/models/image_observer.rb @@ -0,0 +1,9 @@ +class ImageObserver < ActiveRecord::Observer + + def after_save(image) + condormatic_classads_sync + end +end + +ImageObserver.instance + diff --git a/src/app/util/condormatic.rb b/src/app/util/condormatic.rb new file mode 100644 index 0000000..5c1b582 --- /dev/null +++ b/src/app/util/condormatic.rb @@ -0,0 +1,241 @@ +# +# Copyright (C) 2010 Red Hat, Inc. +# Written by Ian Main imain@redhat.com +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; version 2 of the License. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, +# MA 02110-1301, USA. A copy of the GNU General Public License is +# also available at http://www.gnu.org/copyleft/gpl.html. + +def condormatic_instance_create(task) + + begin + instance = task.instance + realm = instance.realm rescue nil + + job_name = "job_#{instance.name}_#{instance.id}" + + + # I use the 2>&1 to get stderr and stdout together because popen3 does not support + # the ability to get the exit value of the command in ruby 1.8. + pipe = IO.popen("condor_submit 2>&1", "w+") + pipe.puts "universe = grid\n" + Rails.logger.info "universe = grid\n" + pipe.puts "executable = #{job_name}\n" + Rails.logger.info "executable = #{job_name}\n" + pipe.puts "grid_resource = dcloud $$(provider_url) $$(username) $$(password) $$(image_key) #{instance.name} $$(realm_key) $$(hardwareprofile_key)\n" + Rails.logger.info "grid_resource = dcloud $$(provider_url) $$(username) $$(password) $$(image_key) #{instance.name} $$(realm_key) $$(hardwareprofile_key)\n" + pipe.puts "log = #{job_name}.log\n" + Rails.logger.info "log = #{job_name}.log\n" + + requirements = "requirements = hardwareprofile == "#{instance.hardware_profile.id}" && image == "#{instance.image.id}"" + requirements += " && realm == "#{realm.name}"" if realm != nil + requirements += "\n" + + pipe.puts requirements + Rails.logger.info requirements + + pipe.puts "notification = never\n" + Rails.logger.info "notification = never\n" + pipe.puts "queue\n" + Rails.logger.info "queue\n" + pipe.close_write + out = pipe.read + pipe.close + + Rails.logger.info "$? (return value?) is #{$?}" + raise ("Error calling condor_submit: #{out}") if $? != 0 + + instance.condor_job_id = job_name + instance.save! + + rescue Exception => ex + task.state = Task::STATE_FAILED + Rails.logger.error ex.message + Rails.logger.error ex.backtrace + else + # FIXME: We're kinda lying here.. we don't know the state for the task but I don't think that matters so much + # as we are just going to use the 'task' table as a kind of audit log. + task.state = Task::STATE_PENDING + end + task.instance.save! +end + +# JobStatus for condor jobs: +# +# 0 Unexpanded U +# 1 Idle I +# 2 Running R +# 3 Removed X +# 4 Completed C +# 5 Held H +# 6 Submission_err E +# + +def condor_to_instance_state(state_val) + case state_val + when '0' + return Instance::STATE_PENDING + when '1' + return Instance::STATE_PENDING + when '2' + return Instance::STATE_RUNNING + when '3' + return Instance::STATE_STOPPED + when '4' + return Instance::STATE_STOPPED + when '5' + return Instance::STATE_CREATE_FAILED + when '6' + return Instance::STATE_CREATE_FAILED + else + return Instance::STATE_PENDING + end +end + +def condormatic_instances_sync_states + + begin + # I'm not going to do the 2&>1 trick here since we are parsing the output + # and I'm afraid we'll get a warning or something on stderr and it'll mess + # up the xml parsing. + pipe = IO.popen("condor_q -xml") + xml = pipe.read + pipe.close + + raise ("Error calling condor_q -xml") if $? != 0 + + # Set them all to 'stopped' because if they aren't in the condor + # queue as jobs then they are not running, pending or anything else. + instances = Instance.find(:all) + instances.each do |instance| + instance.state = Instance::STATE_STOPPED + instance.save! + end + + def find_value_int(job_ele, attrib) + if job_ele.attributes['n'] == attrib + cmd = job_ele.elements.each('i') do |i| + return i.text + end + end + return nil + end + + def find_value_str(job_ele, attrib) + if job_ele.attributes['n'] == attrib + cmd = job_ele.elements.each('s') do |s| + return s.text + end + end + return nil + end + + doc = REXML::Document.new(xml) + doc.elements.each('classads/c') do |jobs_ele| + job_name = nil + job_state = nil + + jobs_ele.elements.each('a') do |job_ele| + value = find_value_str(job_ele, 'Cmd') + job_name = value if value != nil + value = find_value_int(job_ele, 'JobStatus') + job_state = value if value != nil + end + + Rails.logger.info "job name is #{job_name}" + Rails.logger.info "job state is #{job_state}" + + instance = Instance.find(:first, :conditions => {:condor_job_id => job_name}) + + if instance + instance.state = condor_to_instance_state(job_state) + instance.save! + Rails.logger.info "Instance state updated to #{condor_to_instance_state(job_state)}" + end + end + rescue Exception => ex + Rails.logger.error ex.message + Rails.logger.error ex.backtrace + end +end + +def condormatic_instance_stop(task) + instance = task.instance + + Rails.logger.info("calling condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + pipe = IO.popen("condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + out = pipe.read + pipe.close + + Rails.logger.info("condor_rm return status is #{$?}") + Rails.logger.error("Error calling condor_rm (exit code #{$?}) on job: #{out}") if $? != 0 +end + +def condormatic_instance_destroy(task) + instance = task.instance + + Rails.logger.info("calling condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + pipe = IO.popen("condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + out = pipe.read + pipe.close + + Rails.logger.info("condor_rm return status is #{$?}") + Rails.logger.error("Error calling condor_rm (exit code #{$?}) on job: #{out}") if $? != 0 +end + + +def condormatic_classads_sync + + index = 0 + providers = Provider.find(:all) + Rails.logger.info "Syncing classads.." + + providers.each do |provider| + provider.cloud_accounts.each do |account| + provider.images.each do |image| + provider.hardware_profiles.each do |hwp| + provider.realms.each do |realm| + pipe = IO.popen("condor_advertise UPDATE_STARTD_AD 2>&1", "w+") + + pipe.puts "Name="provider_combination_#{index}"" + pipe.puts 'MyType="Machine"' + pipe.puts 'Requirements=true' + pipe.puts "\n# Stuff needed to match:" + pipe.puts "hardwareprofile="#{hwp.aggregator_hardware_profiles[0].id}"" + pipe.puts "image="#{image.aggregator_images[0].id}"" + pipe.puts "realm="#{realm.name}"" + pipe.puts "\n# Backend info to complete this job:" + pipe.puts "image_key="#{image.external_key}"" + pipe.puts "hardwareprofile_key="#{hwp.external_key}"" + pipe.puts "realm_key="#{realm.external_key}"" + pipe.puts "provider_url="#{account.provider.url}"" + pipe.puts "username="#{account.username}"" + pipe.puts "password="#{account.password}"" + pipe.close_write + + out = pipe.read + pipe.close + + Rails.logger.error "Unable to submit condor classad: #{out}" if $? != 0 + + index += 1 + end + end + end + end + + Rails.logger.info "done" + end +end + diff --git a/src/config/environment.rb b/src/config/environment.rb index 919a710..eb11f17 100644 --- a/src/config/environment.rb +++ b/src/config/environment.rb @@ -50,7 +50,7 @@ Rails::Initializer.run do |config| config.gem "gnuplot" config.gem "scruffy"
- config.active_record.observers = :instance_observer, :task_observer + config.active_record.observers = :instance_observer, :task_observer, :hardware_profile_observer, :image_observer # Only load the plugins named here, in the order given. By default, all plugins # in vendor/plugins are loaded in alphabetical order. # :all can be used as a placeholder for all plugins not explicitly named diff --git a/src/config/initializers/condor_classads_sync.rb b/src/config/initializers/condor_classads_sync.rb new file mode 100644 index 0000000..9165f75 --- /dev/null +++ b/src/config/initializers/condor_classads_sync.rb @@ -0,0 +1,8 @@ +require 'util/condormatic' + +puts "Syncing condor classads.." +# This pulls all the possible classad matches from the database and puts +# them on condor on startup. +condormatic_classads_sync +puts "Done." + diff --git a/src/db/migrate/20090804142049_create_instances.rb b/src/db/migrate/20090804142049_create_instances.rb index 335b93f..42706e1 100644 --- a/src/db/migrate/20090804142049_create_instances.rb +++ b/src/db/migrate/20090804142049_create_instances.rb @@ -32,6 +32,7 @@ class CreateInstances < ActiveRecord::Migration t.string :public_address t.string :private_address t.string :state + t.string :condor_job_id t.integer :lock_version, :default => 0 t.integer :acc_pending_time, :default => 0 t.integer :acc_running_time, :default => 0
Minor errors first:
warning: 4 lines add whitespace errors. src/app/models/hardware_profile_observer.rb:1: new blank line at EOF. src/app/models/image_observer.rb:1: new blank line at EOF. src/app/util/condormatic.rb:1: new blank line at EOF. src/config/initializers/condor_classads_sync.rb:1: new blank line at EOF.
More troublesome: the call in the initializer kills the ability to run a migration on a new database, as it tries to select from a non-existent table. The solution may be to put this in an after_initializer: http://guides.rubyonrails.org/configuring.html#using-an-after-initializer
I bypassed that with a simple comment, then added a provider (fine). However, when I added an account, I got the following error:
RuntimeError (Called id for nil, which would mistakenly be 4 -- if you really wanted the id of nil, use object_id): app/util/condormatic.rb:216:in `condormatic_classads_sync' /usr/lib/ruby/gems/1.8/gems/will_paginate-2.3.14/lib/will_paginate/finder.rb:168:in `method_missing' /usr/lib/ruby/gems/1.8/gems/will_paginate-2.3.14/lib/will_paginate/finder.rb:168:in `method_missing' app/util/condormatic.rb:208:in `condormatic_classads_sync' /usr/lib/ruby/gems/1.8/gems/will_paginate-2.3.14/lib/will_paginate/finder.rb:168:in `method_missing' /usr/lib/ruby/gems/1.8/gems/will_paginate-2.3.14/lib/will_paginate/finder.rb:168:in `method_missing' app/util/condormatic.rb:207:in `condormatic_classads_sync' /usr/lib/ruby/gems/1.8/gems/will_paginate-2.3.14/lib/will_paginate/finder.rb:168:in `method_missing' /usr/lib/ruby/gems/1.8/gems/will_paginate-2.3.14/lib/will_paginate/finder.rb:168:in `method_missing' app/util/condormatic.rb:206:in `condormatic_classads_sync' /usr/lib/ruby/gems/1.8/gems/will_paginate-2.3.14/lib/will_paginate/finder.rb:168:in `method_missing' /usr/lib/ruby/gems/1.8/gems/will_paginate-2.3.14/lib/will_paginate/finder.rb:168:in `method_missing' app/util/condormatic.rb:205:in `condormatic_classads_sync' app/util/condormatic.rb:204:in `each' app/util/condormatic.rb:204:in `condormatic_classads_sync' app/models/image_observer.rb:4:in `after_save' /usr/lib/ruby/1.8/observer.rb:185:in `notify_observers' /usr/lib/ruby/1.8/observer.rb:184:in `each' /usr/lib/ruby/1.8/observer.rb:184:in `notify_observers' app/models/cloud_account.rb:114:in `populate_realms_and_images' app/models/cloud_account.rb:101:in `each' app/models/cloud_account.rb:101:in `populate_realms_and_images' app/models/cloud_account.rb:88:in `populate_realms_and_images' app/controllers/cloud_accounts_controller.rb:42:in `create' haml (2.2.24) rails/./lib/sass/plugin/rails.rb:20:in `process'
I'll poke around more, see if I can figure out the issue, but can't test much farther till this is resolved.
On Tue, 2010-06-29 at 15:23 -0400, Jason Guiditta wrote:
Minor errors first:
warning: 4 lines add whitespace errors. src/app/models/hardware_profile_observer.rb:1: new blank line at EOF. src/app/models/image_observer.rb:1: new blank line at EOF. src/app/util/condormatic.rb:1: new blank line at EOF. src/config/initializers/condor_classads_sync.rb:1: new blank line at EOF.
Believe it or not through many years of my being a software developer this was a requirement :)
eg:
http://www.microchip.com/forums/m439495-print.aspx
So I stubbornly continue to do this.
More troublesome: the call in the initializer kills the ability to run a migration on a new database, as it tries to select from a non-existent table. The solution may be to put this in an after_initializer: http://guides.rubyonrails.org/configuring.html#using-an-after-initializer
I bypassed that with a simple comment, then added a provider (fine). However, when I added an account, I got the following error:
So it appears I have not rebased to the latest 'next'. I'll do that, resolve issues and post a new patch. I don't think using an initializer is so much the problem as that the function/method it calls is borked.
Thanks!
Ian
Ian Main wrote:
On Tue, 2010-06-29 at 15:23 -0400, Jason Guiditta wrote:
Minor errors first:
warning: 4 lines add whitespace errors. src/app/models/hardware_profile_observer.rb:1: new blank line at EOF. src/app/models/image_observer.rb:1: new blank line at EOF. src/app/util/condormatic.rb:1: new blank line at EOF. src/config/initializers/condor_classads_sync.rb:1: new blank line at EOF.
Believe it or not through many years of my being a software developer this was a requirement :)
eg:
http://www.microchip.com/forums/m439495-print.aspx
So I stubbornly continue to do this.
By all means keep ending your files w/ newline -- 'no newline' confuses diff, and I'm not sure emacs will even let you save a file without it adding a newline at the end for you.
However I think the whitespace error noted above by git was blank empty lines at the end of the file -- i.e. 2 or more newlines at the end of the file. That's different and, I believe, not there in many of our files right now.
Scott
On Wed, 2010-06-30 at 08:54 -0400, Scott Seago wrote:
Ian Main wrote:
On Tue, 2010-06-29 at 15:23 -0400, Jason Guiditta wrote:
Minor errors first:
warning: 4 lines add whitespace errors. src/app/models/hardware_profile_observer.rb:1: new blank line at EOF. src/app/models/image_observer.rb:1: new blank line at EOF. src/app/util/condormatic.rb:1: new blank line at EOF. src/config/initializers/condor_classads_sync.rb:1: new blank line at EOF.
Believe it or not through many years of my being a software developer this was a requirement :)
eg:
http://www.microchip.com/forums/m439495-print.aspx
So I stubbornly continue to do this.
By all means keep ending your files w/ newline -- 'no newline' confuses diff, and I'm not sure emacs will even let you save a file without it adding a newline at the end for you.
However I think the whitespace error noted above by git was blank empty lines at the end of the file -- i.e. 2 or more newlines at the end of the file. That's different and, I believe, not there in many of our files right now.
Scott
Yeah it's all good I'll remove them for the next patch. :) It's just a habbit I picked up along the way.
Ian
This patch is a first try at using condor as a job management system. This removes the usage of the 'taskomatic' utilities and replaces them with 'condormatic' calls that use the command line interfaces (no qmf or gsoap etc) to condor.
On startup of the server (and any changes after running), a pile of 'classads' are created which define each possible startup location for a given set of image/hardware profiles that exist and are useable as well as the backend info condor needs to start an instance on the given provider.
For each instance that you start, a job will be created in condor. Condor will then match the hardware profile and image to a provider and can then start an instance on that provider. When you stop or destroy that instance, the job will be removed (which isn't really how we want it to go but..).
This patch requires that you have our custom hacked up condor installed. You can get this at:
http://people.redhat.com/clalance/condor-dcloud
Be sure to read the README. Chris has written up very good instructions on how to set up condor.
REVISED: This patch plus the new condor fixes a number of the bugs that were in the previous patch. This patch adds realm matching support and fixes the start/stop issues we were seeing. So most things basically work now and I think it's generally useable. Probably the biggest outstanding bug for useability is that we do not keep long-running jobs for stateful instances.
The outstanding bugs are now limited to:
- To 'stop' a job in condor we should be using 'hold' instead of removing the job. This is creating a few different problems. - We are still reaching directly to the DeltaCloud API to get a list of available actions for each instance. Maybe this is fine, I'm not sure. - Quotas are not yet implemented. - Classads are sync'd to condor on startup and on any changes to the hardware profile and image records. However, if you restart condor you won't have any classads in it to match against and your jobs will fail. - We're still using 'on-demand' syncing of states from condor to the aggregator. eg when you list the instances it updates the states of each instance from condor at that time. There is no event logging. - There's no 'reboot' as yet in condor. Not sure how we'll deal with that just yet. - We've kept the tasks model and usage but they are quazi-meaningless. The task table needs to turn into an event/audit log table.
Signed-off-by: Ian Main imain@redhat.com --- src/app/controllers/cloud_accounts_controller.rb | 1 + src/app/controllers/instance_controller.rb | 17 +- src/app/controllers/pool_controller.rb | 6 +- src/app/controllers/provider_controller.rb | 1 + src/app/util/condormatic.rb | 240 +++++++++++++++++++++ src/config/environment.rb | 12 +- src/db/migrate/20090804142049_create_instances.rb | 1 + 7 files changed, 270 insertions(+), 8 deletions(-) create mode 100644 src/app/util/condormatic.rb
diff --git a/src/app/controllers/cloud_accounts_controller.rb b/src/app/controllers/cloud_accounts_controller.rb index 96eb9af..9343f62 100644 --- a/src/app/controllers/cloud_accounts_controller.rb +++ b/src/app/controllers/cloud_accounts_controller.rb @@ -45,6 +45,7 @@ class CloudAccountsController < ApplicationController else render :action => "new" end + condormatic_classads_sync end
def edit diff --git a/src/app/controllers/instance_controller.rb b/src/app/controllers/instance_controller.rb index 039ed3a..5664ec5 100644 --- a/src/app/controllers/instance_controller.rb +++ b/src/app/controllers/instance_controller.rb @@ -19,7 +19,7 @@ # Filters added to this controller apply to all controllers in the application. # Likewise, all the methods added will be available for all controllers.
-require 'util/taskomatic' +require 'util/condormatic'
class InstanceController < ApplicationController before_filter :require_user @@ -96,8 +96,7 @@ class InstanceController < ApplicationController :task_target => @instance, :action => InstanceTask::ACTION_CREATE}) if @task.save - task_impl = Taskomatic.new(@task,logger) - task_impl.instance_create + condormatic_instance_create(@task) flash[:notice] = "Instance added." redirect_to :controller => "pool", :action => 'show', :id => @instance.pool_id else @@ -124,8 +123,16 @@ class InstanceController < ApplicationController raise ActionError.new("#{action} cannot be performed on this instance.") end
- task_impl = Taskomatic.new(@task,logger) - task_impl.send "instance_#{action}" + case action + when 'stop' + condormatic_instance_stop(@task) + when 'destroy' + condormatic_instance_destroy(@task) + when 'start' + condormatic_instance_create(@task) + else + raise ActionError.new("Sorry, action '#{action}' is currently not supported by condor backend.") + end
alert = "#{@instance.name}: #{action} was successfully queued." flash[:notice] = alert diff --git a/src/app/controllers/pool_controller.rb b/src/app/controllers/pool_controller.rb index e687c0b..ad2d42b 100644 --- a/src/app/controllers/pool_controller.rb +++ b/src/app/controllers/pool_controller.rb @@ -20,6 +20,7 @@ # Likewise, all the methods added will be available for all controllers.
require 'util/taskomatic' +require 'util/condormatic'
class PoolController < ApplicationController before_filter :require_user @@ -36,8 +37,8 @@ class PoolController < ApplicationController #FIXME: clean this up, many error cases here @pool = Pool.find(params[:id]) require_privilege(Privilege::INSTANCE_VIEW,@pool) - # pass nil into Taskomatic as we're not working off a task here - Taskomatic.new(nil,logger).pool_refresh(@pool) + # Go to condor and sync the database to the real instance states + condormatic_instances_sync_states @pool.reload @instances = @pool.instances end @@ -153,4 +154,5 @@ class PoolController < ApplicationController end redirect_to :action => 'show', :id => @pool.id end + condormatic_classads_sync end diff --git a/src/app/controllers/provider_controller.rb b/src/app/controllers/provider_controller.rb index 8ba84fe..59c9393 100644 --- a/src/app/controllers/provider_controller.rb +++ b/src/app/controllers/provider_controller.rb @@ -39,6 +39,7 @@ class ProviderController < ApplicationController flash[:notice] = "Provider added." redirect_to :action => "show", :id => @provider end + condormatic_classads_sync end
def destroy diff --git a/src/app/util/condormatic.rb b/src/app/util/condormatic.rb new file mode 100644 index 0000000..e994504 --- /dev/null +++ b/src/app/util/condormatic.rb @@ -0,0 +1,240 @@ +# +# Copyright (C) 2010 Red Hat, Inc. +# Written by Ian Main imain@redhat.com +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; version 2 of the License. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, +# MA 02110-1301, USA. A copy of the GNU General Public License is +# also available at http://www.gnu.org/copyleft/gpl.html. + +def condormatic_instance_create(task) + + begin + instance = task.instance + realm = instance.realm rescue nil + + job_name = "job_#{instance.name}_#{instance.id}" + + + # I use the 2>&1 to get stderr and stdout together because popen3 does not support + # the ability to get the exit value of the command in ruby 1.8. + pipe = IO.popen("condor_submit 2>&1", "w+") + pipe.puts "universe = grid\n" + Rails.logger.info "universe = grid\n" + pipe.puts "executable = #{job_name}\n" + Rails.logger.info "executable = #{job_name}\n" + pipe.puts "grid_resource = dcloud $$(provider_url) $$(username) $$(password) $$(image_key) #{instance.name} $$(realm_key) $$(hardwareprofile_key)\n" + Rails.logger.info "grid_resource = dcloud $$(provider_url) $$(username) $$(password) $$(image_key) #{instance.name} $$(realm_key) $$(hardwareprofile_key)\n" + pipe.puts "log = #{job_name}.log\n" + Rails.logger.info "log = #{job_name}.log\n" + + requirements = "requirements = hardwareprofile == "#{instance.hardware_profile.id}" && image == "#{instance.image.id}"" + requirements += " && realm == "#{realm.name}"" if realm != nil + requirements += "\n" + + pipe.puts requirements + Rails.logger.info requirements + + pipe.puts "notification = never\n" + Rails.logger.info "notification = never\n" + pipe.puts "queue\n" + Rails.logger.info "queue\n" + pipe.close_write + out = pipe.read + pipe.close + + Rails.logger.info "$? (return value?) is #{$?}" + raise ("Error calling condor_submit: #{out}") if $? != 0 + + instance.condor_job_id = job_name + instance.save! + + rescue Exception => ex + task.state = Task::STATE_FAILED + Rails.logger.error ex.message + Rails.logger.error ex.backtrace + else + # FIXME: We're kinda lying here.. we don't know the state for the task but I don't think that matters so much + # as we are just going to use the 'task' table as a kind of audit log. + task.state = Task::STATE_PENDING + end + task.instance.save! +end + +# JobStatus for condor jobs: +# +# 0 Unexpanded U +# 1 Idle I +# 2 Running R +# 3 Removed X +# 4 Completed C +# 5 Held H +# 6 Submission_err E +# + +def condor_to_instance_state(state_val) + case state_val + when '0' + return Instance::STATE_PENDING + when '1' + return Instance::STATE_PENDING + when '2' + return Instance::STATE_RUNNING + when '3' + return Instance::STATE_STOPPED + when '4' + return Instance::STATE_STOPPED + when '5' + return Instance::STATE_CREATE_FAILED + when '6' + return Instance::STATE_CREATE_FAILED + else + return Instance::STATE_PENDING + end +end + +def condormatic_instances_sync_states + + begin + # I'm not going to do the 2&>1 trick here since we are parsing the output + # and I'm afraid we'll get a warning or something on stderr and it'll mess + # up the xml parsing. + pipe = IO.popen("condor_q -xml") + xml = pipe.read + pipe.close + + raise ("Error calling condor_q -xml") if $? != 0 + + # Set them all to 'stopped' because if they aren't in the condor + # queue as jobs then they are not running, pending or anything else. + instances = Instance.find(:all) + instances.each do |instance| + instance.state = Instance::STATE_STOPPED + instance.save! + end + + def find_value_int(job_ele, attrib) + if job_ele.attributes['n'] == attrib + cmd = job_ele.elements.each('i') do |i| + return i.text + end + end + return nil + end + + def find_value_str(job_ele, attrib) + if job_ele.attributes['n'] == attrib + cmd = job_ele.elements.each('s') do |s| + return s.text + end + end + return nil + end + + doc = REXML::Document.new(xml) + doc.elements.each('classads/c') do |jobs_ele| + job_name = nil + job_state = nil + + jobs_ele.elements.each('a') do |job_ele| + value = find_value_str(job_ele, 'Cmd') + job_name = value if value != nil + value = find_value_int(job_ele, 'JobStatus') + job_state = value if value != nil + end + + Rails.logger.info "job name is #{job_name}" + Rails.logger.info "job state is #{job_state}" + + instance = Instance.find(:first, :conditions => {:condor_job_id => job_name}) + + if instance + instance.state = condor_to_instance_state(job_state) + instance.save! + Rails.logger.info "Instance state updated to #{condor_to_instance_state(job_state)}" + end + end + rescue Exception => ex + Rails.logger.error ex.message + Rails.logger.error ex.backtrace + end +end + +def condormatic_instance_stop(task) + instance = task.instance + + Rails.logger.info("calling condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + pipe = IO.popen("condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + out = pipe.read + pipe.close + + Rails.logger.info("condor_rm return status is #{$?}") + Rails.logger.error("Error calling condor_rm (exit code #{$?}) on job: #{out}") if $? != 0 +end + +def condormatic_instance_destroy(task) + instance = task.instance + + Rails.logger.info("calling condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + pipe = IO.popen("condor_rm -constraint 'Cmd == "#{instance.condor_job_id}"' 2>&1") + out = pipe.read + pipe.close + + Rails.logger.info("condor_rm return status is #{$?}") + Rails.logger.error("Error calling condor_rm (exit code #{$?}) on job: #{out}") if $? != 0 +end + + +def condormatic_classads_sync + + index = 0 + providers = Provider.find(:all) + Rails.logger.info "Syncing classads.." + + providers.each do |provider| + provider.cloud_accounts.each do |account| + provider.images.each do |image| + provider.hardware_profiles.each do |hwp| + provider.realms.each do |realm| + pipe = IO.popen("condor_advertise UPDATE_STARTD_AD 2>&1", "w+") + + pipe.puts "Name="provider_combination_#{index}"" + pipe.puts 'MyType="Machine"' + pipe.puts 'Requirements=true' + pipe.puts "\n# Stuff needed to match:" + pipe.puts "hardwareprofile="#{hwp.aggregator_hardware_profiles[0].id}"" + pipe.puts "image="#{image.aggregator_images[0].id}"" + pipe.puts "realm="#{realm.name}"" + pipe.puts "\n# Backend info to complete this job:" + pipe.puts "image_key="#{image.external_key}"" + pipe.puts "hardwareprofile_key="#{hwp.external_key}"" + pipe.puts "realm_key="#{realm.external_key}"" + pipe.puts "provider_url="#{account.provider.url}"" + pipe.puts "username="#{account.username}"" + pipe.puts "password="#{account.password}"" + pipe.close_write + + out = pipe.read + pipe.close + + Rails.logger.error "Unable to submit condor classad: #{out}" if $? != 0 + + index += 1 + end + end + end + end + + Rails.logger.info "done" + end +end diff --git a/src/config/environment.rb b/src/config/environment.rb index 919a710..3b49fbc 100644 --- a/src/config/environment.rb +++ b/src/config/environment.rb @@ -24,6 +24,7 @@ RAILS_GEM_VERSION = '>= 2.3.2' unless defined? RAILS_GEM_VERSION
# Bootstrap the Rails environment, frameworks, and default configuration require File.join(File.dirname(__FILE__), 'boot') +require 'util/condormatic'
Rails::Initializer.run do |config| # Settings in config/environments/* take precedence over those specified here @@ -80,5 +81,14 @@ Rails::Initializer.run do |config| # config.i18n.load_path += Dir[Rails.root.join('my', 'locales', '*.{rb,yml}')] # config.i18n.default_locale = :de
- + config.after_initialize do + begin + # This pulls all the possible classad matches from the database and puts + # them on condor on startup. Note that this can fail because this is run on startup + # even for rake db:migrate etc. which won't work since the database doesn't exist + # yet. + condormatic_classads_sync + rescue Exception => ex + end + end end diff --git a/src/db/migrate/20090804142049_create_instances.rb b/src/db/migrate/20090804142049_create_instances.rb index 335b93f..42706e1 100644 --- a/src/db/migrate/20090804142049_create_instances.rb +++ b/src/db/migrate/20090804142049_create_instances.rb @@ -32,6 +32,7 @@ class CreateInstances < ActiveRecord::Migration t.string :public_address t.string :private_address t.string :state + t.string :condor_job_id t.integer :lock_version, :default => 0 t.integer :acc_pending_time, :default => 0 t.integer :acc_running_time, :default => 0
On Wed, 2010-06-30 at 10:16 -0700, Ian Main wrote:
This patch is a first try at using condor as a job management system. This removes the usage of the 'taskomatic' utilities and replaces them with 'condormatic' calls that use the command line interfaces (no qmf or gsoap etc) to condor.
On startup of the server (and any changes after running), a pile of 'classads' are created which define each possible startup location for a given set of image/hardware profiles that exist and are useable as well as the backend info condor needs to start an instance on the given provider.
For each instance that you start, a job will be created in condor. Condor will then match the hardware profile and image to a provider and can then start an instance on that provider. When you stop or destroy that instance, the job will be removed (which isn't really how we want it to go but..).
This patch requires that you have our custom hacked up condor installed. You can get this at:
http://people.redhat.com/clalance/condor-dcloud
Be sure to read the README. Chris has written up very good instructions on how to set up condor.
REVISED: This patch plus the new condor fixes a number of the bugs that were in the previous patch. This patch adds realm matching support and fixes the start/stop issues we were seeing. So most things basically work now and I think it's generally useable. Probably the biggest outstanding bug for useability is that we do not keep long-running jobs for stateful instances.
The outstanding bugs are now limited to:
- To 'stop' a job in condor we should be using 'hold' instead of removing the job. This is creating a few different problems.
- We are still reaching directly to the DeltaCloud API to get a list of available actions for each instance. Maybe this is fine, I'm not sure.
- Quotas are not yet implemented.
- Classads are sync'd to condor on startup and on any changes to the hardware profile and image records. However, if you restart condor you won't have any classads in it to match against and your jobs will fail.
- We're still using 'on-demand' syncing of states from condor to the aggregator. eg when you list the instances it updates the states of each instance from condor at that time. There is no event logging.
- There's no 'reboot' as yet in condor. Not sure how we'll deal with that just yet.
- We've kept the tasks model and usage but they are quazi-meaningless. The task table needs to turn into an event/audit log table.
ACK, this works for me. Couple minor notes we talked about, just want to make sure they dont get forgotten. It would be great to update the directions chris has with the 'yum local' bit, and add at least a comment that for dev, 'ALLOW_WRITE = *' is what you want.
Lastly, and not directly related to the patch - this will be confusing for people checking out next unless we get some docs up on the site, especially since 'contribute' directions lean toward using 'next'. So we need some docs, or at least to do that adapter idea I have been pushing, with the default leaving it to taskomatic, and a comment saying how to enable condor.
-j
On Wed, 2010-06-30 at 20:26 +0000, Jason Guiditta wrote:
ACK, this works for me. Couple minor notes we talked about, just want to make sure they dont get forgotten. It would be great to update the directions chris has with the 'yum local' bit, and add at least a comment that for dev, 'ALLOW_WRITE = *' is what you want.
Lastly, and not directly related to the patch - this will be confusing for people checking out next unless we get some docs up on the site, especially since 'contribute' directions lean toward using 'next'. So we need some docs, or at least to do that adapter idea I have been pushing, with the default leaving it to taskomatic, and a comment saying how to enable condor.
-j
Well I'm going to keep pushing back on that one.. :) I don't think it's a good idea because all the devs will just want to use the builtin taskomatic one and we'll get much less testing of the condor back end. Also there will be divergent implementations and hence behaviors.. and we'll have to maintain two implementations.
The only downside to just using condor is installation difficulty, but I really don't think it's that onerous to install and setup condor, certainly no more difficult than any other component we are using.
I will wait to push this until we can get the website updated with instructions. With a few changes it'll basically just be a yum localinstall command and copy in the local config. It's going to be very easy.
Ian
This patch is now pushed to next, so you'll require condor to be installed as per:
http://people.redhat.com/clalance/condor-dcloud/README
I also just posted a documentation patch which adds information on how to setup and configure condor for the aggregator. This should be up on the 'contributor' page on deltacloud.org in the next day or so.
Ian
deltacloud-devel@lists.fedorahosted.org