We have a somewhat particular RHQ setup where we monitor a large number of resources remotely from a single agent. Par agent, we have +/- 25000 scheduled measurements with +/- 1500 measurement collected per minute. Since most of the metrics are collected with the same interval (10 minutes), this causes the following problem: when the agent is started (t=0), it will schedule all these metrics in the same interval [0s,30s]. However, because of the large number of measurements, the agent is not able to collect all of them in that 30s interval and will reschedule the remaining ones to the next interval in the original schedule, i.e. to [10m,10m+30s]. The same thing again happens in the interval [10m,10m+30s] and most of the measurements are rescheduled to the next interval [20m,20m+30s] and so forth. This means that some metrics are never collected (and are reported as "late" in the metrics of the RHQ agent).
Note that the issue only occurs after restarting the agent. When the resources are originally added to the inventory, the corresponding measurement schedules are spread more or less randomly and the agent is able to collect all of them.
To solve that issue with RHQ 3.0, I applied the following patch:
Index: src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java =================================================================== --- src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java (revision 141630) +++ src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java (revision 141631) @@ -484,6 +484,13 @@ this.scheduledRequests.offer(scheduledMeasurement); } } + + public synchronized void reschedule(Set<ScheduledMeasurementInfo> scheduledMeasurementInfos, long interval) { + for (ScheduledMeasurementInfo scheduledMeasurement : scheduledMeasurementInfos) { + scheduledMeasurement.setNextCollection(scheduledMeasurement.getNextCollection() + interval); + this.scheduledRequests.offer(scheduledMeasurement); + } + }
/** * Sends the given measurement report to the server, if this plugin container has server services that it can Index: src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java =================================================================== --- src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java (revision 141630) +++ src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java (revision 141631) @@ -71,7 +71,7 @@ log.debug("Measurement collection is falling behind... Missed requested time by [" + (System.currentTimeMillis() - requests.iterator().next().getNextCollection()) + "ms]");
- this.measurementManager.reschedule(requests); + this.measurementManager.reschedule(requests, 30000L); return report; }
The idea is that instead of rescheduling the measurement according to the original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should simply be rescheduled to the next interval (from [0s,30s] to [30s,60s]).
We are currently in the process of upgrading to RHQ 4.4. I didn't test the patch with that version yet, but after looking at the code I think it is still applicable. I would like to get some feedback about the approach: is it a valid way to solve the issue or are there better ways to do that?
Andreas
Andreas,
Can you write a BZ on this, explaining the problem (as you did here) and attach your patch to it? I'd like to track this. I might have time to look at this myself.
Thanks, John
----- Original Message -----
We have a somewhat particular RHQ setup where we monitor a large number of resources remotely from a single agent. Par agent, we have +/- 25000 scheduled measurements with +/- 1500 measurement collected per minute. Since most of the metrics are collected with the same interval (10 minutes), this causes the following problem: when the agent is started (t=0), it will schedule all these metrics in the same interval [0s,30s]. However, because of the large number of measurements, the agent is not able to collect all of them in that 30s interval and will reschedule the remaining ones to the next interval in the original schedule, i.e. to [10m,10m+30s]. The same thing again happens in the interval [10m,10m+30s] and most of the measurements are rescheduled to the next interval [20m,20m+30s] and so forth. This means that some metrics are never collected (and are reported as "late" in the metrics of the RHQ agent).
Note that the issue only occurs after restarting the agent. When the resources are originally added to the inventory, the corresponding measurement schedules are spread more or less randomly and the agent is able to collect all of them.
To solve that issue with RHQ 3.0, I applied the following patch:
Index: src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java =================================================================== --- src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java (revision 141630) +++ src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java (revision 141631) @@ -484,6 +484,13 @@ this.scheduledRequests.offer(scheduledMeasurement); } }
- public synchronized void
reschedule(Set<ScheduledMeasurementInfo> scheduledMeasurementInfos, long interval) {
for (ScheduledMeasurementInfo scheduledMeasurement :
scheduledMeasurementInfos) {
scheduledMeasurement.setNextCollection(scheduledMeasurement.getNextCollection()
interval);
this.scheduledRequests.offer(scheduledMeasurement);
}
}
/**
- Sends the given measurement report to the server, if this
plugin container has server services that it can Index: src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java ===================================================================
src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java (revision 141630) +++ src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java (revision 141631) @@ -71,7 +71,7 @@ log.debug("Measurement collection is falling behind... Missed requested time by [" + (System.currentTimeMillis() - requests.iterator().next().getNextCollection()) + "ms]");
this.measurementManager.reschedule(requests);
this.measurementManager.reschedule(requests,
30000L); return report; }
The idea is that instead of rescheduling the measurement according to the original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should simply be rescheduled to the next interval (from [0s,30s] to [30s,60s]).
We are currently in the process of upgrading to RHQ 4.4. I didn't test the patch with that version yet, but after looking at the code I think it is still applicable. I would like to get some feedback about the approach: is it a valid way to solve the issue or are there better ways to do that?
Andreas _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
Done. See BZ 834019.
Thanks,
Andreas
On Wed, Jun 20, 2012 at 3:50 PM, John Mazzitelli mazz@redhat.com wrote:
Andreas,
Can you write a BZ on this, explaining the problem (as you did here) and attach your patch to it? I'd like to track this. I might have time to look at this myself.
Thanks, John
----- Original Message -----
We have a somewhat particular RHQ setup where we monitor a large number of resources remotely from a single agent. Par agent, we have +/- 25000 scheduled measurements with +/- 1500 measurement collected per minute. Since most of the metrics are collected with the same interval (10 minutes), this causes the following problem: when the agent is started (t=0), it will schedule all these metrics in the same interval [0s,30s]. However, because of the large number of measurements, the agent is not able to collect all of them in that 30s interval and will reschedule the remaining ones to the next interval in the original schedule, i.e. to [10m,10m+30s]. The same thing again happens in the interval [10m,10m+30s] and most of the measurements are rescheduled to the next interval [20m,20m+30s] and so forth. This means that some metrics are never collected (and are reported as "late" in the metrics of the RHQ agent).
Note that the issue only occurs after restarting the agent. When the resources are originally added to the inventory, the corresponding measurement schedules are spread more or less randomly and the agent is able to collect all of them.
To solve that issue with RHQ 3.0, I applied the following patch:
Index: src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java =================================================================== --- src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java (revision 141630) +++ src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java (revision 141631) @@ -484,6 +484,13 @@ this.scheduledRequests.offer(scheduledMeasurement); } }
- public synchronized void
reschedule(Set<ScheduledMeasurementInfo> scheduledMeasurementInfos, long interval) {
- for (ScheduledMeasurementInfo scheduledMeasurement :
scheduledMeasurementInfos) {
scheduledMeasurement.setNextCollection(scheduledMeasurement.getNextCollection()
- interval);
- this.scheduledRequests.offer(scheduledMeasurement);
- }
- }
/** * Sends the given measurement report to the server, if this plugin container has server services that it can Index: src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java ===================================================================
src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java (revision 141630) +++ src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java (revision 141631) @@ -71,7 +71,7 @@ log.debug("Measurement collection is falling behind... Missed requested time by [" + (System.currentTimeMillis() - requests.iterator().next().getNextCollection()) + "ms]");
- this.measurementManager.reschedule(requests);
- this.measurementManager.reschedule(requests,
30000L); return report; }
The idea is that instead of rescheduling the measurement according to the original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should simply be rescheduled to the next interval (from [0s,30s] to [30s,60s]).
We are currently in the process of upgrading to RHQ 4.4. I didn't test the patch with that version yet, but after looking at the code I think it is still applicable. I would like to get some feedback about the approach: is it a valid way to solve the issue or are there better ways to do that?
Andreas _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
Avoiding these processing peaks is something we've been looking at. In RHQ 4.4 we introduce a different approach to availability collection that spreads out processing in an attempt to not swamp the agent with avail checks for all resources at the same time.
As for metric collections we need a similar solution and that is the direction in which your patch is going. The subtle thing here is that to be efficient we want to group metric requests for the same resource as much as possible. In that way the component code may be able to perform the collection more efficiently. I'm not very familiar with the code as it stands today but that is something to think about. One possibility would be to stagger the initial collections more randomly to begin with and then future scheduling at fixed intervals would presumable remain staggered.
Of course the other way to go in the interim is to stagger collection intervals as much as possible. Instead of every 10 minutes for almost all metrics, set some to 7,11,13 etc... Perhaps that would help.
Thanks for the suggested patch.
On 6/20/2012 9:13 AM, Andreas Veithen wrote:
We have a somewhat particular RHQ setup where we monitor a large number of resources remotely from a single agent. Par agent, we have +/- 25000 scheduled measurements with +/- 1500 measurement collected per minute. Since most of the metrics are collected with the same interval (10 minutes), this causes the following problem: when the agent is started (t=0), it will schedule all these metrics in the same interval [0s,30s]. However, because of the large number of measurements, the agent is not able to collect all of them in that 30s interval and will reschedule the remaining ones to the next interval in the original schedule, i.e. to [10m,10m+30s]. The same thing again happens in the interval [10m,10m+30s] and most of the measurements are rescheduled to the next interval [20m,20m+30s] and so forth. This means that some metrics are never collected (and are reported as "late" in the metrics of the RHQ agent).
Note that the issue only occurs after restarting the agent. When the resources are originally added to the inventory, the corresponding measurement schedules are spread more or less randomly and the agent is able to collect all of them.
To solve that issue with RHQ 3.0, I applied the following patch:
Index: src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java
--- src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java (revision 141630) +++ src/main/java/org/rhq/core/pc/measurement/MeasurementManager.java (revision 141631) @@ -484,6 +484,13 @@ this.scheduledRequests.offer(scheduledMeasurement); } }
- public synchronized void reschedule(Set<ScheduledMeasurementInfo>
scheduledMeasurementInfos, long interval) {
for (ScheduledMeasurementInfo scheduledMeasurement :
scheduledMeasurementInfos) {
scheduledMeasurement.setNextCollection(scheduledMeasurement.getNextCollection()
interval);
this.scheduledRequests.offer(scheduledMeasurement);
}
}
/**
- Sends the given measurement report to the server, if this
plugin container has server services that it can Index: src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java =================================================================== --- src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java (revision 141630) +++ src/main/java/org/rhq/core/pc/measurement/MeasurementCollectorRunner.java (revision 141631) @@ -71,7 +71,7 @@ log.debug("Measurement collection is falling behind... Missed requested time by [" + (System.currentTimeMillis() - requests.iterator().next().getNextCollection()) + "ms]");
this.measurementManager.reschedule(requests);
this.measurementManager.reschedule(requests, 30000L); return report; }
The idea is that instead of rescheduling the measurement according to the original schedule (e.g. from [0s,30s] to [10m,10m+30s]), it should simply be rescheduled to the next interval (from [0s,30s] to [30s,60s]).
We are currently in the process of upgrading to RHQ 4.4. I didn't test the patch with that version yet, but after looking at the code I think it is still applicable. I would like to get some feedback about the approach: is it a valid way to solve the issue or are there better ways to do that?
Andreas _______________________________________________ rhq-users mailing list rhq-users@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/rhq-users
rhq-users@lists.fedorahosted.org