On 25 Oct 2018, at 17:46, thierry bordaz <tbordaz(a)redhat.com>
wrote:
On 10/11/2018 12:57 AM, William Brown wrote:
> On Wed, 2018-10-10 at 16:26 +0200, thierry bordaz wrote:
>
>> Hi William,
>>
>> Thanks for starting this discussion.
>> Your email raise several aspects (How, for whom,..) and I think a way
>> to
>> start would be to write down what we want.
>> A need is from a given workload to determine where we are spending
>> time
>> as a way to determine where to invest.
>> An other need is to collect metrics at operation level.
>>
> Aren't these very similar? The time we invest is generally on improving
> a plugin or a small part of an operation, to make the operation as a
> whole faster.
>
It could be the used tools that are similar but the difference is about expected results.
For example I was just discussing with a user who reported:
[24/Oct/2018:12:10:55.012908141 -0800] conn=2400 op=1 MODRDN
dn="<DN_one>" newsuperior="(null)"
[24/Oct/2018:12:11:01.711604553 -0800] conn=2400 op=1 RESULT err=1 tag=109 nentries=0
etime=6.1301230184 csn=5bd0d1cf000000010000
tried the same modrdn in my test environment, resulting in no error and no latency.
[24/Oct/2018:12:14:03.665479821 -0800] conn=138 op=1 MODRDN dn="<DN_one>"
newsuperior="(null)"
[24/Oct/2018:12:14:03.749121724 -0800] conn=138 op=1 RESULT err=0 tag=109 nentries=0
etime=0.0083774655 csn=5bd0d28b0000/00010000
So here the expected result is not to improve performance but having a diagnostic
method/tool to know what is going on in production compare to tests.
Yes. This is a perfect example of why we should provide logging, not stap scripts. The
user can then enable an access log level that says something like:
MODRDN dn= result ….
RESULT err=1 tag ….
PROFILE {
start_op: <time>
events: [
startaci: <time>
aci_log: ….
endaci: <time>,
start pre plugins: <time>
start memberof: <time>
end memberof: <time>
end plugin: <time>
start backend: <time>
end backend: <time>
]
end_op: <time>
}
We would know exactly what’s wrong with that operation.
Probably a good idea to stick in access log timings too since I suspect our log subsystem
is the problem. But this way admins can send us profiling reports of the server, with
everything we need to diagnose the issue for each operation. Structuring this log with
json means we can write tools to parse it, etc.
—
Sincerely,
William