Value Types in event logs (Re: syslog-like API for structured messages)

List overview All Threads
Download

newer

older

Syslog/Lumberjack compatibility...

Proposed plan

William Heinbockel

22 Mar 2012 22 Mar '12

10:46 a.m.

Splitting this conversation into a separate thread...

While I agree that being able to transmit type information with logs is a noble goal, there are many nuances, especially across JSON and XML.

JSON handles a few basic types well, namely string, int, double, boolean, but will require additional work to support other types, such as datetime. We need to determine if this is worth addressing. Seeing the the most popular format will probably be JSON over Syslog, we will lose the type information if it is not made available.

XML has more flexibility with typing, but only in combination with XML Schema. This means that you either have to define all of the field names a priori in XML Schema, or define a minimal schema that binds type information to predefined type elements.

For example, in order to support this

`<Event><dst_ip>1.2.3.4</dst_ip></Event>`

I need to have a related XML Schema that defines dst_ip has a type of IPv4 Address (otherwise it will be treated as a string or ducktyped into an IPv4 address)

`<Event><dst_ip type="ipv4">1.2.3.4</dst_ip></Event>`

This poses similar issues. XML Schema cannot validate the @type attribute based on the dst_ip value (though this is fairly trivial to do with XSLT or similar). You also have the issue of what if dst_ip is defined as an xs:int in the schema but @type is "ipv4", which value type wins. Also, this approach works will for atomic types, but does not work as well if it is a structure and contains child elements.

For the best compatibility with XML Schema: `<Event><ipv4 name="dst_ip">1.2.3.4</dst_ip></Event>`

This works better for XML Schema validation. But is not as natural to use as the former examples.

I have no problem with either of the above solutions. After some thought, option #2 might be the best, but we need to figure out how to handle/represent structures and make this representable with XML Schema. As I mention above, this is fairly trivial for atomic types, but I don't know how to do it.

On Wed, Mar 21, 2012 at 4:12 PM, Botond Botyanszki boti@nxlog.org wrote:

...

On Wed, 21 Mar 2012 14:15:47 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
On Wed, Mar 21, 2012 at 2:12 PM, Dmitri Pal dpal@redhat.com wrote:

...
On 03/20/2012 12:00 PM, david@lang.hm wrote:

...
On Tue, 20 Mar 2012, Gergely Nagy wrote:

...
david@lang.hm writes:

...
I think that we are going to need a type system before long.

Yeah, but not in JSON, where it would be bolted upon.

That's reasonable. It just means we need to support more than just JSON soon :-)

Type system of JSON is good enough. I might be a good compromise between no types and everything has a schema.

I'd call it 'better than nothing'. There are some types lacking, most notably the DateTime type, which are mostly essential in our case.

...
+1 While I have nothing against explicit typing, I don't see the need.

-1 If you only think about forwarding and storing text (based logs), probably there is no need for that. But once you need to analyze the data where you compare and sort values, knowing the type of the value is pretty much required.

...
I would like to have some way to align the JSON structures with XML representations, though. The only real issue here is the mapping of JSON arrays to a similar XML structure.

I think mapping arrays is pretty straightforward: JSON: { "addr":["1.2.3.4","2.3.4.5"] } XML:

<event> <addr>1.2.3.4</addr> <addr>2.3.4.5</addr> </event> The problem here is mapping the type information what we discussed earlier an mostly agreed that squeezing it into JSON gets a little ugly.

Yep

Show replies by date

David Lang

22 Mar 22 Mar

11:24 a.m.

On Thu, 22 Mar 2012, William Heinbockel wrote:

...

Splitting this conversation into a separate thread...

While I agree that being able to transmit type information with logs is a noble goal, there are many nuances, especially across JSON and XML.

JSON handles a few basic types well, namely string, int, double, boolean, but will require additional work to support other types, such as datetime. We need to determine if this is worth addressing. Seeing the the most popular format will probably be JSON over Syslog, we will lose the type information if it is not made available.

XML has more flexibility with typing, but only in combination with XML Schema. This means that you either have to define all of the field names a priori in XML Schema, or define a minimal schema that binds type information to predefined type elements.

At this point I'm less worried about the representation of the types, than the idea that types should exist.

One of the major problems with processing traditional log messages is that since there is no type information, the processing software needs to guess what the field is. Getting clear deliniations between fields will help a lot, but there are so many ways to put a date in a string that it's just insane (and not all of them can be identified by context, i.e. m/d/y vs d/m/y)

Even if the type system is not enforced by the protocol (i.e. XML schema), or even if it can't always be passed in the protocol (i.e JSON), I think that it's valuable to have the types defined early so that they can be passed in the API call as the log is generated. If nothing else, this gives the library generating the message a chance to sanity check the formats, even if the type data is thrown away immediatly when it's serialized.

I don't think that we need a lot of different types, but I do think that we need more than just string and number

Things I can think of offhand

Strings (ascii, UTF8, UTF16, serialized to UTF8 over the wire)

Numbers (possibly not even not 'int', 'double', 'fload', just number, at the transport layer use the most appropriate detailed type of fall back to 'string containing arbatrary presicion number')

Boolean (accept the various language specific values, serialize to standard representation. i.e. in Perl '',0,'False' would be equivalent)

timestamp

IP (IPv4, IPv6, either)

If there are any common fields that have special character limits, they may be worth adding. For example

Hostname/FQDN

In terms of structures, other than a simple one-dimentional array (which contains elements of a single type) what other requirements are there?

David Lang

William Heinbockel

1:22 p.m.

On Thu, Mar 22, 2012 at 11:24 AM, david@lang.hm wrote:

...

On Thu, 22 Mar 2012, William Heinbockel wrote:

...
Splitting this conversation into a separate thread...

While I agree that being able to transmit type information with logs is a noble goal, there are many nuances, especially across JSON and XML.

JSON handles a few basic types well, namely string, int, double, boolean, but will require additional work to support other types, such as datetime. We need to determine if this is worth addressing. Seeing the the most popular format will probably be JSON over Syslog, we will lose the type information if it is not made available.

XML has more flexibility with typing, but only in combination with XML Schema. This means that you either have to define all of the field names a priori in XML Schema, or define a minimal schema that binds type information to predefined type elements.

At this point I'm less worried about the representation of the types, than the idea that types should exist.

I am not trying to discuss the representation of types as much as the whether types should be explicit or implicit, and what the impact of explicit types may have on the data structures.

It sounds like we should do our best to support explicit typing

...

One of the major problems with processing traditional log messages is that since there is no type information, the processing software needs to guess what the field is. Getting clear deliniations between fields will help a lot, but there are so many ways to put a date in a string that it's just insane (and not all of them can be identified by context, i.e. m/d/y vs d/m/y)

Even if the type system is not enforced by the protocol (i.e. XML schema), or even if it can't always be passed in the protocol (i.e JSON), I think that it's valuable to have the types defined early so that they can be passed in the API call as the log is generated. If nothing else, this gives the library generating the message a chance to sanity check the formats, even if the type data is thrown away immediatly when it's serialized.

I don't think that we need a lot of different types, but I do think that we need more than just string and number

Agreed. We need to start with the minimal set necessary for logs

...

Things I can think of offhand

Strings (ascii, UTF8, UTF16, serialized to UTF8 over the wire)

Numbers (possibly not even not 'int', 'double', 'fload', just number, at the transport layer use the most appropriate detailed type of fall back to 'string containing arbatrary presicion number')

Boolean (accept the various language specific values, serialize to standard representation. i.e. in Perl '',0,'False' would be equivalent)

timestamp

IP (IPv4, IPv6, either)

If there are any common fields that have special character limits, they may be worth adding. For example

Hostname/FQDN

In terms of structures, other than a simple one-dimentional array (which contains elements of a single type) what other requirements are there?

Just the requirement to support explicit types and identify what types are necessary. Also, whether or not we want to include a (notional) 'structured' type for completeness.

Building off of your list, minimally we need:

* string (utf-8 -- I would prefer not to have to handle all the mess of unicode conversions) * number (define similarly to json?) * boolean * ipv4 * ipv6 * fqdn (I say if we have ipv4/6 we should also have hostname/fqdn) * time (ISO8601 format?)

Other options include:

* float, double * (unsigned) int8, int16, int32, int64 * MAC Address * E-mail * URI * struct (defined, but probably rarely used)

David Lang

1:32 p.m.

On Thu, 22 Mar 2012, William Heinbockel wrote:

...

On Thu, Mar 22, 2012 at 11:24 AM, david@lang.hm wrote:

...
On Thu, 22 Mar 2012, William Heinbockel wrote:

...
Splitting this conversation into a separate thread...

While I agree that being able to transmit type information with logs is a noble goal, there are many nuances, especially across JSON and XML.

JSON handles a few basic types well, namely string, int, double, boolean, but will require additional work to support other types, such as datetime. We need to determine if this is worth addressing. Seeing the the most popular format will probably be JSON over Syslog, we will lose the type information if it is not made available.

XML has more flexibility with typing, but only in combination with XML Schema. This means that you either have to define all of the field names a priori in XML Schema, or define a minimal schema that binds type information to predefined type elements.

At this point I'm less worried about the representation of the types, than the idea that types should exist.

I am not trying to discuss the representation of types as much as the whether types should be explicit or implicit, and what the impact of explicit types may have on the data structures.

It sounds like we should do our best to support explicit typing

...
One of the major problems with processing traditional log messages is that since there is no type information, the processing software needs to guess what the field is. Getting clear deliniations between fields will help a lot, but there are so many ways to put a date in a string that it's just insane (and not all of them can be identified by context, i.e. m/d/y vs d/m/y)

Even if the type system is not enforced by the protocol (i.e. XML schema), or even if it can't always be passed in the protocol (i.e JSON), I think that it's valuable to have the types defined early so that they can be passed in the API call as the log is generated. If nothing else, this gives the library generating the message a chance to sanity check the formats, even if the type data is thrown away immediatly when it's serialized.

I don't think that we need a lot of different types, but I do think that we need more than just string and number

Agreed. We need to start with the minimal set necessary for logs

...
Things I can think of offhand

Strings (ascii, UTF8, UTF16, serialized to UTF8 over the wire)

Numbers (possibly not even not 'int', 'double', 'fload', just number, at the transport layer use the most appropriate detailed type of fall back to 'string containing arbatrary presicion number')

Boolean (accept the various language specific values, serialize to standard representation. i.e. in Perl '',0,'False' would be equivalent)

timestamp

IP (IPv4, IPv6, either)

If there are any common fields that have special character limits, they may be worth adding. For example

Hostname/FQDN

In terms of structures, other than a simple one-dimentional array (which contains elements of a single type) what other requirements are there?

Just the requirement to support explicit types and identify what types are necessary. Also, whether or not we want to include a (notional) 'structured' type for completeness.

Building off of your list, minimally we need:

string (utf-8 -- I would prefer not to have to handle all the mess

of unicode conversions)

over the wire I absolutly agree with you. In the software APIs to generate the message, I think it may be worth doing the conversion (at least in some cases), if the language defaults to something other than utf8, then conversion is required by the sending library.

...

number (define similarly to json?)

boolean

ipv4

ipv6

fqdn (I say if we have ipv4/6 we should also have hostname/fqdn)

time (ISO8601 format?)

Other options include:

float, double

(unsigned) int8, int16, int32, int64

MAC Address

E-mail

URI

struct (defined, but probably rarely used)

MAC will be common enough to be valuable, if/when some new network technology comes along, it's addressing will need to be added (or the field will need to be treated as text)

E-mail is a good one, but we need to make sure we make the definition complete (including '+' addressing for example)

URI is a good one.

I think that the various numeric types should not be part of the API. I think the API should work on arbitrary precision numbers, but if it's convienient to serialize them as one of those types, go for it.

David Lang

Gergely Nagy

23 Mar 23 Mar

12:37 p.m.

david@lang.hm writes:

...

...
Building off of your list, minimally we need:

string (utf-8 -- I would prefer not to have to handle all the mess

of unicode conversions)

over the wire I absolutly agree with you. In the software APIs to generate the message, I think it may be worth doing the conversion (at least in some cases), if the language defaults to something other than utf8, then conversion is required by the sending library.

...

...

number (define similarly to json?)

boolean

ipv4

ipv6

fqdn (I say if we have ipv4/6 we should also have hostname/fqdn)

I wouldn't treat this as a separate type, because it's just a string. The purpose should be either obvious from the key name, or documented. I don't see the benefit of adding an fqdn *type*.

With IPv4/IPv6, there is a benefit of more efficient storage and serialization/deserialization. With an FQDN, there's no such benefit, imo.

...

...
Other options include:

float, double

(unsigned) int8, int16, int32, int64

These are already covered by "number" above, no? Either explicit number types, and then unsigned int8->int64 with float and double, or a generic 'number'.

...

From an (C) implementation PoV, 'number' sucks, and it's also less

efficient than the other option, imo.

...

...

E-mail

...

E-mail is a good one, but we need to make sure we make the definition complete (including '+' addressing for example)

What advantage would this have over a string type?

...

...

URI

URI is a good one.

Same question here.

Perhaps I'm misunderstanding something, but to me, types describe how the data is stored, how it should be serialized and read back again. Its intent is something different.

We could then add 'facility' and 'priority' types, since they're pretty much part of all messages sent via (legacy) syslog. But to what end?

...

I think that the various numeric types should not be part of the API. I think the API should work on arbitrary precision numbers, but if it's convienient to serialize them as one of those types, go for it.

I disagree with this, because from an implementation point of view, it's a nightmare to deal with a generic 'number' type. I'd rather go the other way, and provide API for the different numeric types, and allow higher-level languages that do have a generic numeric type-ish thing to figure out which low-level representation to use.

-- |8]

Dmitri Pal

1:15 p.m.

On 03/23/2012 12:37 PM, Gergely Nagy wrote:

...

I disagree with this, because from an implementation point of view, it's a nightmare to deal with a generic 'number' type. I'd rather go the other way, and provide API for the different numeric types, and allow higher-level languages that do have a generic numeric type-ish thing to figure out which low-level representation to use.

+1 this is why the collection supports so many different types. I forgot to mention that it also supports float.

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Rainer Gerhards

1:26 p.m.

...

-----Original Message----- From: lumberjack-developers-bounces@lists.fedorahosted.org [mailto:lumberjack-developers-bounces@lists.fedorahosted.org] On Behalf Of Dmitri Pal Sent: Friday, March 23, 2012 6:15 PM To: lumberjack-developers@lists.fedorahosted.org Subject: Re: [lumberjack] Value Types in event logs (Re: syslog-like API for structured messages)

On 03/23/2012 12:37 PM, Gergely Nagy wrote:

...
I disagree with this, because from an implementation point of view,

it's

...
a nightmare to deal with a generic 'number' type. I'd rather go the other way, and provide API for the different numeric types, and allow higher-level languages that do have a generic numeric type-ish thing

to

...
figure out which low-level representation to use.

+1 this is why the collection supports so many different types. I forgot to mention that it also supports float.

Rainer

William Heinbockel

3:16 p.m.

On Fri, Mar 23, 2012 at 1:26 PM, Rainer Gerhards rgerhards@hq.adiscon.com wrote:

...

...
-----Original Message----- From: lumberjack-developers-bounces@lists.fedorahosted.org [mailto:lumberjack-developers-bounces@lists.fedorahosted.org] On Behalf Of Dmitri Pal Sent: Friday, March 23, 2012 6:15 PM To: lumberjack-developers@lists.fedorahosted.org Subject: Re: [lumberjack] Value Types in event logs (Re: syslog-like API for structured messages)

On 03/23/2012 12:37 PM, Gergely Nagy wrote:

...
I disagree with this, because from an implementation point of view,

it's

...
a nightmare to deal with a generic 'number' type. I'd rather go the other way, and provide API for the different numeric types, and allow higher-level languages that do have a generic numeric type-ish thing

to

...
figure out which low-level representation to use.

+1 this is why the collection supports so many different types. I forgot to mention that it also supports float.

+1

There are two separate, but complementary conversations going on here: message types vs. API types

One way to resolve this is to resolve the API requirements first, then map them to message format types. The problem with all of these isn't so much the message creation, either in the API or message format, it is with the parsing, comparison, and usage of encoded message values.

Let's first focus on the API values types, for this we should probably support the following:

* uint8, 16, 32, 64 * int8, 16, 32, 64 * float, double * string (API can decide/handle the various encoding formats) * IPv4, IPv6, MAC * timestamp (millisec since epoch), timezone (seconds offset from UTC) * boolean * octet array

Gergely and Botond are correct, most other things will always be represented as a subset of a string:

* email * URI * FQDN

Any number that is too large for 64-bit integers or double would have to be sent as a string... Unless we go to support arbitrary precision numbers, we will always have this problem. Though I think this is probably not a problem we should be concerned with with logging

Dmitri Pal

3:36 p.m.

On 03/23/2012 03:16 PM, William Heinbockel wrote:

...

On Fri, Mar 23, 2012 at 1:26 PM, Rainer Gerhards rgerhards@hq.adiscon.com wrote:

...
...
-----Original Message----- From: lumberjack-developers-bounces@lists.fedorahosted.org [mailto:lumberjack-developers-bounces@lists.fedorahosted.org] On Behalf Of Dmitri Pal Sent: Friday, March 23, 2012 6:15 PM To: lumberjack-developers@lists.fedorahosted.org Subject: Re: [lumberjack] Value Types in event logs (Re: syslog-like API for structured messages)

On 03/23/2012 12:37 PM, Gergely Nagy wrote:

...
I disagree with this, because from an implementation point of view,

it's

...
a nightmare to deal with a generic 'number' type. I'd rather go the other way, and provide API for the different numeric types, and allow higher-level languages that do have a generic numeric type-ish thing

to

...
figure out which low-level representation to use.

+1 this is why the collection supports so many different types. I forgot to mention that it also supports float.

+1

There are two separate, but complementary conversations going on here: message types vs. API types

They should be logically related. If the developer can't mentally easily map one to another - we loose.

...

One way to resolve this is to resolve the API requirements first, then map them to message format types. The problem with all of these isn't so much the message creation, either in the API or message format, it is with the parsing, comparison, and usage of encoded message values.

Let's first focus on the API values types, for this we should probably support the following:

uint8, 16, 32, 64

It is an overkill 32 & 64 should be enough.

...

int8, 16, 32, 64

It is an overkill 32 & 64 should be enough.

...

float, double

one of those should be enough.

...

string (API can decide/handle the various encoding formats)

yes different encodings might affect serialization, this is on my list to investigate

...

IPv4, IPv6, MAC

This is a sub type of a string

...

timestamp (millisec since epoch), timezone (seconds offset from UTC)

Those are special attributes that should be prepopulated by the library and can be recognized by name when serialized and processed.

...

boolean

Sure

...

octet array

You mean binary. Fine. And how you want it to be encoded in JSON? Currently it is ascii hex representation in single quotes in the serialization code I am writing. I do not have a better idea. I can do base64 but still it needs to be escaped in some way to differentiate from a normal string in JSON.

...

Gergely and Botond are correct, most other things will always be represented as a subset of a string:

email

URI

FQDN

Sub types of the string we do not need to have a special API type for them.

...

Any number that is too large for 64-bit integers or double would have to be sent as a string... Unless we go to support arbitrary precision numbers, we will always have this problem. Though I think this is probably not a problem we should be concerned with with logging

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

David Lang

5:36 p.m.

On Fri, 23 Mar 2012, Dmitri Pal wrote:

...

On 03/23/2012 03:16 PM, William Heinbockel wrote:

...
On Fri, Mar 23, 2012 at 1:26 PM, Rainer Gerhards rgerhards@hq.adiscon.com wrote:

...
...
-----Original Message----- From: lumberjack-developers-bounces@lists.fedorahosted.org [mailto:lumberjack-developers-bounces@lists.fedorahosted.org] On Behalf Of Dmitri Pal Sent: Friday, March 23, 2012 6:15 PM To: lumberjack-developers@lists.fedorahosted.org Subject: Re: [lumberjack] Value Types in event logs (Re: syslog-like API for structured messages)

On 03/23/2012 12:37 PM, Gergely Nagy wrote:

...
I disagree with this, because from an implementation point of view,

it's

...
a nightmare to deal with a generic 'number' type. I'd rather go the other way, and provide API for the different numeric types, and allow higher-level languages that do have a generic numeric type-ish thing

to

...
figure out which low-level representation to use.

+1 this is why the collection supports so many different types. I forgot to mention that it also supports float.

+1

There are two separate, but complementary conversations going on here: message types vs. API types

agreed.

...

They should be logically related. If the developer can't mentally easily map one to another - we loose.

...
One way to resolve this is to resolve the API requirements first, then map them to message format types. The problem with all of these isn't so much the message creation, either in the API or message format, it is with the parsing, comparison, and usage of encoded message values.

Let's first focus on the API values types, for this we should probably support the following:

The API may support more values than the message format, for example a C API should support every int variation as a type that can be passed to the logging call. However there is no need for there to be so many different types in the message (since the types can be promoted to larger types without loss)

David Lang

...

...

uint8, 16, 32, 64

It is an overkill 32 & 64 should be enough.

...

int8, 16, 32, 64

It is an overkill 32 & 64 should be enough.

...

float, double

one of those should be enough.

...

string (API can decide/handle the various encoding formats)

yes different encodings might affect serialization, this is on my list to investigate

...

IPv4, IPv6, MAC

This is a sub type of a string

...

timestamp (millisec since epoch), timezone (seconds offset from UTC)

Those are special attributes that should be prepopulated by the library and can be recognized by name when serialized and processed.

...

boolean

Sure

...

octet array

You mean binary. Fine. And how you want it to be encoded in JSON? Currently it is ascii hex representation in single quotes in the serialization code I am writing. I do not have a better idea. I can do base64 but still it needs to be escaped in some way to differentiate from a normal string in JSON.

...
Gergely and Botond are correct, most other things will always be represented as a subset of a string:

email

URI

FQDN

Sub types of the string we do not need to have a special API type for them.

...
Any number that is too large for 64-bit integers or double would have to be sent as a string... Unless we go to support arbitrary precision numbers, we will always have this problem. Though I think this is probably not a problem we should be concerned with with logging

+1

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal

Sr. Engineering Manager IPA project, Red Hat Inc.

Looking to carve out IT costs? www.redhat.com/carveoutcosts/

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

David Lang

5:32 p.m.

On Fri, 23 Mar 2012, William Heinbockel wrote:

...

On Fri, Mar 23, 2012 at 1:26 PM, Rainer Gerhards

There are two separate, but complementary conversations going on here: message types vs. API types

One way to resolve this is to resolve the API requirements first, then map them to message format types. The problem with all of these isn't so much the message creation, either in the API or message format, it is with the parsing, comparison, and usage of encoded message values.

Let's first focus on the API values types, for this we should probably support the following:

uint8, 16, 32, 64

int8, 16, 32, 64

float, double

string (API can decide/handle the various encoding formats)

IPv4, IPv6, MAC

timestamp (millisec since epoch), timezone (seconds offset from UTC)

There should also be time duration options

...

boolean

octet array

Gergely and Botond are correct, most other things will always be represented as a subset of a string:

email

URI

FQDN

The question for these types of things is if they are common enough to define a type for them so that the API can do validation to decide if the value is valid or not.

...

Any number that is too large for 64-bit integers or double would have to be sent as a string... Unless we go to support arbitrary precision numbers, we will always have this problem. Though I think this is probably not a problem we should be concerned with with logging

I think that there should be an arbitrary precision option.

David Lang

Botond Botyanszki

5:51 p.m.

On Fri, 23 Mar 2012 14:32:46 -0700 (PDT) david@lang.hm wrote:

...

...
Gergely and Botond are correct, most other things will always be represented as a subset of a string:

email

URI

FQDN

The question for these types of things is if they are common enough to define a type for them so that the API can do validation to decide if the value is valid or not.

While the validation of these types can happen in the lib, this should not be mandatory for performance reasons. For custom types the library must be able to pass the type along with the value to the caller so that the application can do the validation and/or other operations depending on the type. So maybe it is better to do the validation outside and only provide validation functions for these types which the user can call if he wants to. This could be done via user defined callbacks as well.

Regards, Botond

David Lang

24 Mar 24 Mar

7:16 a.m.

On Fri, 23 Mar 2012, Botond Botyanszki wrote:

...

On Fri, 23 Mar 2012 14:32:46 -0700 (PDT) david@lang.hm wrote:

...
...
Gergely and Botond are correct, most other things will always be represented as a subset of a string:

email

URI

FQDN

The question for these types of things is if they are common enough to define a type for them so that the API can do validation to decide if the value is valid or not.

While the validation of these types can happen in the lib, this should not be mandatory for performance reasons. For custom types the library must be able to pass the type along with the value to the caller so that the application can do the validation and/or other operations depending on the type. So maybe it is better to do the validation outside and only provide validation functions for these types which the user can call if he wants to. This could be done via user defined callbacks as well.

It should not be mandatory in the library, but if the type information isn't provided it's not possible to do the validation in the library.

This sort of validation falls under the SHOULD or MAY category in RFC speek.

David Lang

Botond Botyanszki

23 Mar 23 Mar

1:23 p.m.

On Fri, 23 Mar 2012 17:37:05 +0100 Gergely Nagy algernon@balabit.hu wrote:

...

With IPv4/IPv6, there is a benefit of more efficient storage and serialization/deserialization. With an FQDN, there's no such benefit, imo.

The benefit of defining types which are represented as strings such as FQDN is that it can be validated and treated a little differently than just a simple string. For example there can be a special operation which checks if a domain is a subdomain, e.g: if $srchost subdomainof 'somehost.com' ... In addition, one has the option to store these types not only as a simple string in the code, but as a complex structure which can yield in more efficient processing.

...

...
...
Other options include:

float, double

(unsigned) int8, int16, int32, int64

These are already covered by "number" above, no? Either explicit number types, and then unsigned int8->int64 with float and double, or a generic 'number'.

From an (C) implementation PoV, 'number' sucks, and it's also less efficient than the other option, imo.

On the other hand having all these signed/unsiged types is an overkill. No surprise that in many higher level languages these are not supported. As a C coder, I'm against having all these combinations of signed/unsigned and bit sizes instead of having 'integer' and 'biginteger' only.

...

...
...

E-mail

...
E-mail is a good one, but we need to make sure we make the definition complete (including '+' addressing for example)

What advantage would this have over a string type?

Store the domain/username separately.

...

...
...

URI

URI is a good one.

Same question here.

URI is a pretty good example. If you ever had the chance to do url filtering, than I'm sure you know a couple use cases where a url had to be checked against some kind of a criteria as the uri can be broken into a couple fields.

...

Perhaps I'm misunderstanding something, but to me, types describe how the data is stored, how it should be serialized and read back again. Its intent is something different.

Not only that. From the above I think this is obvious. In a previous email I already wrote that type information is mostly useful when you need to look at the data and not only serialize/store/send it.

...

We could then add 'facility' and 'priority' types, since they're pretty much part of all messages sent via (legacy) syslog. But to what end?

Actually yes. But this is less useful as the only point of having priority over integer is that the value can be validated so that it is within the range of allowed values.

...

I disagree with this, because from an implementation point of view, it's a nightmare to deal with a generic 'number' type. I'd rather go the other way, and provide API for the different numeric types, and allow higher-level languages that do have a generic numeric type-ish thing to figure out which low-level representation to use.

You already have to deal with generic 'number' types when coding in C. JSON also has this generic number type which is represented by an int64_t or int with most C based json libs.

Regards, Botond

David Lang

5:28 p.m.

On Fri, 23 Mar 2012, Gergely Nagy wrote:

...

david@lang.hm writes:

...
...
Building off of your list, minimally we need:

string (utf-8 -- I would prefer not to have to handle all the mess

of unicode conversions)

over the wire I absolutly agree with you. In the software APIs to generate the message, I think it may be worth doing the conversion (at least in some cases), if the language defaults to something other than utf8, then conversion is required by the sending library.

+1

...
...

number (define similarly to json?)

boolean

ipv4

ipv6

fqdn (I say if we have ipv4/6 we should also have hostname/fqdn)

I wouldn't treat this as a separate type, because it's just a string. The purpose should be either obvious from the key name, or documented. I don't see the benefit of adding an fqdn *type*.

the only justification for fqdn is that there are some characters not allowed in a name.

...

With IPv4/IPv6, there is a benefit of more efficient storage and serialization/deserialization. With an FQDN, there's no such benefit, imo.

...
...
Other options include:

float, double

(unsigned) int8, int16, int32, int64

These are already covered by "number" above, no? Either explicit number types, and then unsigned int8->int64 with float and double, or a generic 'number'.

From an (C) implementation PoV, 'number' sucks, and it's also less efficient than the other option, imo.

remember that we are talking about the over-the-wire protocol, not the API call to submit the message.

...

...
...

E-mail

...
E-mail is a good one, but we need to make sure we make the definition complete (including '+' addressing for example)

What advantage would this have over a string type?

validation

...

...
...

URI

URI is a good one.

Same question here.

Perhaps I'm misunderstanding something, but to me, types describe how the data is stored, how it should be serialized and read back again. Its intent is something different.

The type is how the data is stored, but also includes a definition that can be used for validation.

David Lang

...

We could then add 'facility' and 'priority' types, since they're pretty much part of all messages sent via (legacy) syslog. But to what end?

...
I think that the various numeric types should not be part of the API. I think the API should work on arbitrary precision numbers, but if it's convienient to serialize them as one of those types, go for it.

I disagree with this, because from an implementation point of view, it's a nightmare to deal with a generic 'number' type. I'd rather go the other way, and provide API for the different numeric types, and allow higher-level languages that do have a generic numeric type-ish thing to figure out which low-level representation to use.

-- |8]

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

Botond Botyanszki

22 Mar 22 Mar

3:50 p.m.

Hi,

I think this is a pretty good list that covers all types: http://cee.mitre.org/archive/version0.6/CEE_Common_Log_Syntax-v0.6.html#valu... Some additions to consider might be email, url and timestamp expressed as microseconds since epoch. With these types 'hardcoded' in log processing code it would be possible to handle the value properly. Other vendor/app specific types would default to string if the type is not known. In addition to the above CEE types, CEE would define a list of field names and their default types. If there is no explicit type information in the JSON/XML log, the code can look up the type of the field from the list and use it as default. Just my 2 cents.

Regards, Botond

On Thu, 22 Mar 2012 10:46:37 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...

Splitting this conversation into a separate thread...

While I agree that being able to transmit type information with logs is a noble goal, there are many nuances, especially across JSON and XML.

JSON handles a few basic types well, namely string, int, double, boolean, but will require additional work to support other types, such as datetime. We need to determine if this is worth addressing. Seeing the the most popular format will probably be JSON over Syslog, we will lose the type information if it is not made available.

XML has more flexibility with typing, but only in combination with XML Schema. This means that you either have to define all of the field names a priori in XML Schema, or define a minimal schema that binds type information to predefined type elements.

For example, in order to support this

`<Event><dst_ip>1.2.3.4</dst_ip></Event>`

I need to have a related XML Schema that defines dst_ip has a type of IPv4 Address (otherwise it will be treated as a string or ducktyped into an IPv4 address)

`<Event><dst_ip type="ipv4">1.2.3.4</dst_ip></Event>`

This poses similar issues. XML Schema cannot validate the @type attribute based on the dst_ip value (though this is fairly trivial to do with XSLT or similar). You also have the issue of what if dst_ip is defined as an xs:int in the schema but @type is "ipv4", which value type wins. Also, this approach works will for atomic types, but does not work as well if it is a structure and contains child elements.

For the best compatibility with XML Schema: `<Event><ipv4 name="dst_ip">1.2.3.4</dst_ip></Event>`

This works better for XML Schema validation. But is not as natural to use as the former examples.

I have no problem with either of the above solutions. After some thought, option #2 might be the best, but we need to figure out how to handle/represent structures and make this representable with XML Schema. As I mention above, this is fairly trivial for atomic types, but I don't know how to do it.

On Wed, Mar 21, 2012 at 4:12 PM, Botond Botyanszki boti@nxlog.org wrote:

...
On Wed, 21 Mar 2012 14:15:47 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
On Wed, Mar 21, 2012 at 2:12 PM, Dmitri Pal dpal@redhat.com wrote:

...
On 03/20/2012 12:00 PM, david@lang.hm wrote:

...
On Tue, 20 Mar 2012, Gergely Nagy wrote:

...
david@lang.hm writes:

> I think that we are going to need a type system before long.

Yeah, but not in JSON, where it would be bolted upon.

That's reasonable. It just means we need to support more than just JSON soon :-)

Type system of JSON is good enough. I might be a good compromise between no types and everything has a schema.

I'd call it 'better than nothing'. There are some types lacking, most notably the DateTime type, which are mostly essential in our case.

...
+1 While I have nothing against explicit typing, I don't see the need.

-1 If you only think about forwarding and storing text (based logs), probably there is no need for that. But once you need to analyze the data where you compare and sort values, knowing the type of the value is pretty much required.

...
I would like to have some way to align the JSON structures with XML representations, though. The only real issue here is the mapping of JSON arrays to a similar XML structure.

I think mapping arrays is pretty straightforward: JSON: { "addr":["1.2.3.4","2.3.4.5"] } XML:

<event> <addr>1.2.3.4</addr> <addr>2.3.4.5</addr> </event> The problem here is mapping the type information what we discussed earlier an mostly agreed that squeezing it into JSON gets a little ugly.

Yep _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

David Lang

6:36 p.m.

On Thu, 22 Mar 2012, Botond Botyanszki wrote:

...

Hi,

I think this is a pretty good list that covers all types: http://cee.mitre.org/archive/version0.6/CEE_Common_Log_Syntax-v0.6.html#valu... Some additions to consider might be email, url and timestamp expressed as microseconds since epoch. With these types 'hardcoded' in log processing code it would be possible to handle the value properly.

I disagree with the idea that the Integer must fit in a 64 bit field and am not thrilled at float being a separate type (although with the errors that can creep in with conversion to/from float it may be thr right thing to do)

there will be larger numbers to pass than will fit in 64 bit, having to have the software be able to handle string type values to deal with such numbers seems like a bad idea.

...

Other vendor/app specific types would default to string if the type is not known.

Yes, this is a good fallback, but I think we should really discourage vendor/app specific types.

...

In addition to the above CEE types, CEE would define a list of field names and their default types. If there is no explicit type information in the JSON/XML log, the code can look up the type of the field from the list and use it as default.

personally, I am doubtful as to the value of the CEE message types. I'm all in favor of structured logging, but I have doubts as to the possibility of standardizing the messages. If it happens, great, but I expect that if they do structured logging at all, many vendors will end up with their own message types.

David Lang

...

Just my 2 cents.

Regards, Botond

On Thu, 22 Mar 2012 10:46:37 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
Splitting this conversation into a separate thread...

While I agree that being able to transmit type information with logs is a noble goal, there are many nuances, especially across JSON and XML.

JSON handles a few basic types well, namely string, int, double, boolean, but will require additional work to support other types, such as datetime. We need to determine if this is worth addressing. Seeing the the most popular format will probably be JSON over Syslog, we will lose the type information if it is not made available.

XML has more flexibility with typing, but only in combination with XML Schema. This means that you either have to define all of the field names a priori in XML Schema, or define a minimal schema that binds type information to predefined type elements.

For example, in order to support this

`<Event><dst_ip>1.2.3.4</dst_ip></Event>`

I need to have a related XML Schema that defines dst_ip has a type of IPv4 Address (otherwise it will be treated as a string or ducktyped into an IPv4 address)

`<Event><dst_ip type="ipv4">1.2.3.4</dst_ip></Event>`

This poses similar issues. XML Schema cannot validate the @type attribute based on the dst_ip value (though this is fairly trivial to do with XSLT or similar). You also have the issue of what if dst_ip is defined as an xs:int in the schema but @type is "ipv4", which value type wins. Also, this approach works will for atomic types, but does not work as well if it is a structure and contains child elements.

For the best compatibility with XML Schema: `<Event><ipv4 name="dst_ip">1.2.3.4</dst_ip></Event>`

This works better for XML Schema validation. But is not as natural to use as the former examples.

I have no problem with either of the above solutions. After some thought, option #2 might be the best, but we need to figure out how to handle/represent structures and make this representable with XML Schema. As I mention above, this is fairly trivial for atomic types, but I don't know how to do it.

On Wed, Mar 21, 2012 at 4:12 PM, Botond Botyanszki boti@nxlog.org wrote:

...
On Wed, 21 Mar 2012 14:15:47 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
On Wed, Mar 21, 2012 at 2:12 PM, Dmitri Pal dpal@redhat.com wrote:

...
On 03/20/2012 12:00 PM, david@lang.hm wrote:

...
On Tue, 20 Mar 2012, Gergely Nagy wrote:

> david@lang.hm writes: > >> I think that we are going to need a type system before long. > > Yeah, but not in JSON, where it would be bolted upon.

That's reasonable. It just means we need to support more than just JSON soon :-)

Type system of JSON is good enough. I might be a good compromise between no types and everything has a schema.

I'd call it 'better than nothing'. There are some types lacking, most notably the DateTime type, which are mostly essential in our case.

...
+1 While I have nothing against explicit typing, I don't see the need.

-1 If you only think about forwarding and storing text (based logs), probably there is no need for that. But once you need to analyze the data where you compare and sort values, knowing the type of the value is pretty much required.

...
I would like to have some way to align the JSON structures with XML representations, though. The only real issue here is the mapping of JSON arrays to a similar XML structure.

I think mapping arrays is pretty straightforward: JSON: { "addr":["1.2.3.4","2.3.4.5"] } XML:

<event> <addr>1.2.3.4</addr> <addr>2.3.4.5</addr> </event> The problem here is mapping the type information what we discussed earlier an mostly agreed that squeezing it into JSON gets a little ugly.

Yep _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

Dmitri Pal

23 Mar 23 Mar

11:40 a.m.

On 03/22/2012 06:36 PM, david@lang.hm wrote:

...

On Thu, 22 Mar 2012, Botond Botyanszki wrote:

...
Hi,

I think this is a pretty good list that covers all types: http://cee.mitre.org/archive/version0.6/CEE_Common_Log_Syntax-v0.6.html#valu...

Some additions to consider might be email, url and timestamp expressed as microseconds since epoch. With these types 'hardcoded' in log processing code it would be possible to handle the value properly.

I disagree with the idea that the Integer must fit in a 64 bit field and am not thrilled at float being a separate type (although with the errors that can creep in with conversion to/from float it may be thr right thing to do)

there will be larger numbers to pass than will fit in 64 bit, having to have the software be able to handle string type values to deal with such numbers seems like a bad idea.

...
Other vendor/app specific types would default to string if the type is not known.

Yes, this is a good fallback, but I think we should really discourage vendor/app specific types.

...
In addition to the above CEE types, CEE would define a list of field names and their default types. If there is no explicit type information in the JSON/XML log, the code can look up the type of the field from the list and use it as default.

personally, I am doubtful as to the value of the CEE message types. I'm all in favor of structured logging, but I have doubts as to the possibility of standardizing the messages. If it happens, great, but I expect that if they do structured logging at all, many vendors will end up with their own message types.

David Lang

I do not have a big preference either way. Collection supports: string, binary, signed/unsigned long/normal integers, boolean: /** * @brief Indicates that property is of type "string". * * For elements of type string the length includes the trailing 0. */ #define COL_TYPE_STRING 0x00000001 /** @brief Indicates that property is of type "binary". */ #define COL_TYPE_BINARY 0x00000002 /** @brief Indicates that property is of type "integer". */ #define COL_TYPE_INTEGER 0x00000004 /** @brief Indicates that property is of type "unsigned". */ #define COL_TYPE_UNSIGNED 0x00000008 /** @brief Indicates that property is of type "long". */ #define COL_TYPE_LONG 0x00000010 /** @brief Indicates that property is of type "unsigned long". */ #define COL_TYPE_ULONG 0x00000020 /** @brief Indicates that property is of type "double". */ #define COL_TYPE_DOUBLE 0x00000040 /** @brief Indicates that property is of Boolean type. */ #define COL_TYPE_BOOL 0x00000080

I can add other types into it or express them with a special decorator internally. It is an implementation detail.

However I am concerned about the cerialization. When we serialize into JSON are IP addresses, MAC, email, URI etc are just strings? If they are there is no sense to identify them as special type at the API boundary as it will be lost in traslation. Instead I suggest a convention: the internal representation is string but the prefix of the field name hints the type For example: ipv4 - will be treated as string containing ipv4 for example app can create "ipv4peerhost" ipv4myhost" ipv4otherhost" to express different IPs for different hosts

So the prefixes that we will recognize would be: ipv4, ipv6, mac, email, uri, host, fqdn, stamp

For readability we can suggest the following format of the field names: <prefix>:<name> or <name>:<suffix>

I like the suffix better actually. That would translate into:

new_syslog(..., "myip:ipv4", TYPE_STRING, "127.0.0.1", "hostname:host", TYPE_STRING, "foo" "fqdn:fqdn", TYPE_STRING, "foo.example.com" "email:email", TYPE_STRING, "me@example.com", ...)

We can also be smarter and understand that if caller passed "email" that it is "email:email" and if passed "mac" it is "mac:mac".

This approach makes it simple for developer, simple for the library, the type is passed in JSON and the syslog can pass this info on. Meets all our goals nicely.

What do you think?

...

...
Just my 2 cents.

Regards, Botond

On Thu, 22 Mar 2012 10:46:37 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
Splitting this conversation into a separate thread...

While I agree that being able to transmit type information with logs is a noble goal, there are many nuances, especially across JSON and XML.

JSON handles a few basic types well, namely string, int, double, boolean, but will require additional work to support other types, such as datetime. We need to determine if this is worth addressing. Seeing the the most popular format will probably be JSON over Syslog, we will lose the type information if it is not made available.

XML has more flexibility with typing, but only in combination with XML Schema. This means that you either have to define all of the field names a priori in XML Schema, or define a minimal schema that binds type information to predefined type elements.

For example, in order to support this

`<Event><dst_ip>1.2.3.4</dst_ip></Event>`

I need to have a related XML Schema that defines dst_ip has a type of IPv4 Address (otherwise it will be treated as a string or ducktyped into an IPv4 address)

`<Event><dst_ip type="ipv4">1.2.3.4</dst_ip></Event>`

This poses similar issues. XML Schema cannot validate the @type attribute based on the dst_ip value (though this is fairly trivial to do with XSLT or similar). You also have the issue of what if dst_ip is defined as an xs:int in the schema but @type is "ipv4", which value type wins. Also, this approach works will for atomic types, but does not work as well if it is a structure and contains child elements.

For the best compatibility with XML Schema: `<Event><ipv4 name="dst_ip">1.2.3.4</dst_ip></Event>`

This works better for XML Schema validation. But is not as natural to use as the former examples.

I have no problem with either of the above solutions. After some thought, option #2 might be the best, but we need to figure out how to handle/represent structures and make this representable with XML Schema. As I mention above, this is fairly trivial for atomic types, but I don't know how to do it.

On Wed, Mar 21, 2012 at 4:12 PM, Botond Botyanszki boti@nxlog.org wrote:

...
On Wed, 21 Mar 2012 14:15:47 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
On Wed, Mar 21, 2012 at 2:12 PM, Dmitri Pal dpal@redhat.com wrote:

...
On 03/20/2012 12:00 PM, david@lang.hm wrote: > On Tue, 20 Mar 2012, Gergely Nagy wrote: > >> david@lang.hm writes: >> >>> I think that we are going to need a type system before long. >> >> Yeah, but not in JSON, where it would be bolted upon. > > That's reasonable. It just means we need to support more than

just

...
...
...
> JSON soon :-)

Type system of JSON is good enough. I might be a good

compromise between

...
...
...
no types and everything has a schema.

I'd call it 'better than nothing'. There are some types lacking, most notably the DateTime type, which are mostly essential in our case.

...
+1 While I have nothing against explicit typing, I don't see the need.

-1 If you only think about forwarding and storing text (based logs), probably there is no need for that. But once you need to analyze

the data

...
where you compare and sort values, knowing the type of the value

is pretty

...
much required.

...
I would like to have some way to align the JSON structures with XML representations, though. The only real issue here is the mapping of JSON arrays to a similar XML structure.

I think mapping arrays is pretty straightforward: JSON: { "addr":["1.2.3.4","2.3.4.5"] } XML:

<event> <addr>1.2.3.4</addr> <addr>2.3.4.5</addr> </event> The problem here is mapping the type information what we discussed earlier an mostly agreed that squeezing it into JSON gets a little

ugly.

...
Yep _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

David Lang

5:44 p.m.

On Fri, 23 Mar 2012, Dmitri Pal wrote:

...

I can add other types into it or express them with a special decorator internally. It is an implementation detail.

However I am concerned about the cerialization. When we serialize into JSON are IP addresses, MAC, email, URI etc are just strings? If they are there is no sense to identify them as special type at the API boundary as it will be lost in traslation.

Keep in mind that there will be other serialization protocols, XML and BSON are almost certin to be used at some point.

In addition, types allow for validation early in the process (where the app generating the bogus data may see the error message) rather than at processing time (which may be years later)

...

Instead I suggest a convention: the internal representation is string but the prefix of the field name hints the type For example: ipv4 - will be treated as string containing ipv4 for example app can create "ipv4peerhost" ipv4myhost" ipv4otherhost" to express different IPs for different hosts

So the prefixes that we will recognize would be: ipv4, ipv6, mac, email, uri, host, fqdn, stamp

For readability we can suggest the following format of the field names: <prefix>:<name> or <name>:<suffix>

I like the suffix better actually. That would translate into:

new_syslog(..., "myip:ipv4", TYPE_STRING, "127.0.0.1", "hostname:host", TYPE_STRING, "foo" "fqdn:fqdn", TYPE_STRING, "foo.example.com" "email:email", TYPE_STRING, "me@example.com", ...)

We can also be smarter and understand that if caller passed "email" that it is "email:email" and if passed "mac" it is "mac:mac".

This approach makes it simple for developer, simple for the library, the type is passed in JSON and the syslog can pass this info on. Meets all our goals nicely.

What do you think?

it's redundant to have TYPE_* and the type as part of the field name, we should do one or the other, not both.

David Lang

...

...
...
Just my 2 cents.

Regards, Botond

On Thu, 22 Mar 2012 10:46:37 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
Splitting this conversation into a separate thread...

While I agree that being able to transmit type information with logs is a noble goal, there are many nuances, especially across JSON and XML.

JSON handles a few basic types well, namely string, int, double, boolean, but will require additional work to support other types, such as datetime. We need to determine if this is worth addressing. Seeing the the most popular format will probably be JSON over Syslog, we will lose the type information if it is not made available.

XML has more flexibility with typing, but only in combination with XML Schema. This means that you either have to define all of the field names a priori in XML Schema, or define a minimal schema that binds type information to predefined type elements.

For example, in order to support this

`<Event><dst_ip>1.2.3.4</dst_ip></Event>`

I need to have a related XML Schema that defines dst_ip has a type of IPv4 Address (otherwise it will be treated as a string or ducktyped into an IPv4 address)

`<Event><dst_ip type="ipv4">1.2.3.4</dst_ip></Event>`

This poses similar issues. XML Schema cannot validate the @type attribute based on the dst_ip value (though this is fairly trivial to do with XSLT or similar). You also have the issue of what if dst_ip is defined as an xs:int in the schema but @type is "ipv4", which value type wins. Also, this approach works will for atomic types, but does not work as well if it is a structure and contains child elements.

For the best compatibility with XML Schema: `<Event><ipv4 name="dst_ip">1.2.3.4</dst_ip></Event>`

This works better for XML Schema validation. But is not as natural to use as the former examples.

I have no problem with either of the above solutions. After some thought, option #2 might be the best, but we need to figure out how to handle/represent structures and make this representable with XML Schema. As I mention above, this is fairly trivial for atomic types, but I don't know how to do it.

On Wed, Mar 21, 2012 at 4:12 PM, Botond Botyanszki boti@nxlog.org wrote:

...
On Wed, 21 Mar 2012 14:15:47 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
On Wed, Mar 21, 2012 at 2:12 PM, Dmitri Pal dpal@redhat.com wrote: > On 03/20/2012 12:00 PM, david@lang.hm wrote: >> On Tue, 20 Mar 2012, Gergely Nagy wrote: >> >>> david@lang.hm writes: >>> >>>> I think that we are going to need a type system before long. >>> >>> Yeah, but not in JSON, where it would be bolted upon. >> >> That's reasonable. It just means we need to support more than

just

...
...
>> JSON soon :-) > > Type system of JSON is good enough. I might be a good

compromise between

...
...
> no types and everything has a schema.

I'd call it 'better than nothing'. There are some types lacking, most notably the DateTime type, which are mostly essential in our case.

...
+1 While I have nothing against explicit typing, I don't see the need.

-1 If you only think about forwarding and storing text (based logs), probably there is no need for that. But once you need to analyze

the data

...
where you compare and sort values, knowing the type of the value

is pretty

...
much required.

...
I would like to have some way to align the JSON structures with XML representations, though. The only real issue here is the mapping of JSON arrays to a similar XML structure.

I think mapping arrays is pretty straightforward: JSON: { "addr":["1.2.3.4","2.3.4.5"] } XML:

<event> <addr>1.2.3.4</addr> <addr>2.3.4.5</addr> </event> The problem here is mapping the type information what we discussed earlier an mostly agreed that squeezing it into JSON gets a little

ugly.

...
Yep _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

_______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

Dmitri Pal

6:41 p.m.

On 03/23/2012 05:44 PM, david@lang.hm wrote:

...

On Fri, 23 Mar 2012, Dmitri Pal wrote:

...
I can add other types into it or express them with a special decorator internally. It is an implementation detail.

However I am concerned about the cerialization. When we serialize into JSON are IP addresses, MAC, email, URI etc are just strings? If they are there is no sense to identify them as special type at the API boundary as it will be lost in traslation.

Keep in mind that there will be other serialization protocols, XML and BSON are almost certin to be used at some point.

In addition, types allow for validation early in the process (where the app generating the bogus data may see the error message) rather than at processing time (which may be years later)

...
Instead I suggest a convention: the internal representation is string but the prefix of the field name hints the type For example: ipv4 - will be treated as string containing ipv4 for example app can create "ipv4peerhost" ipv4myhost" ipv4otherhost" to express different IPs for different hosts

So the prefixes that we will recognize would be: ipv4, ipv6, mac, email, uri, host, fqdn, stamp

For readability we can suggest the following format of the field names: <prefix>:<name> or <name>:<suffix>

I like the suffix better actually. That would translate into:

new_syslog(..., "myip:ipv4", TYPE_STRING, "127.0.0.1", "hostname:host", TYPE_STRING, "foo" "fqdn:fqdn", TYPE_STRING, "foo.example.com" "email:email", TYPE_STRING, "me@example.com", ...)

We can also be smarter and understand that if caller passed "email" that it is "email:email" and if passed "mac" it is "mac:mac".

This approach makes it simple for developer, simple for the library, the type is passed in JSON and the syslog can pass this info on. Meets all our goals nicely.

What do you think?

it's redundant to have TYPE_* and the type as part of the field name, we should do one or the other, not both.

OK works for me but how it would look over the wire in JSON? Should the serialization the add the type suffix to overcome the serialization limitations?

...

David Lang

...

--

Thank you, Dmitri Pal

Sr. Engineering Manager IPA project, Red Hat Inc.

------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

David Lang

24 Mar 24 Mar

7:18 a.m.

On Fri, 23 Mar 2012, Dmitri Pal wrote:

...

...
...
What do you think?

it's redundant to have TYPE_* and the type as part of the field name, we should do one or the other, not both.

OK works for me but how it would look over the wire in JSON? Should the serialization the add the type suffix to overcome the serialization limitations?

I would say that if the protocol (in this case JSON) doesn't support a specific type, the type information is silently lost.

string derived types will devolve to string

number types will devolve to number

David Lang

William Heinbockel

9:50 a.m.

On Sat, Mar 24, 2012 at 7:18 AM, david@lang.hm wrote:

...

On Fri, 23 Mar 2012, Dmitri Pal wrote:

...
...
...
What do you think?

it's redundant to have TYPE_* and the type as part of the field name, we should do one or the other, not both.

OK works for me but how it would look over the wire in JSON? Should the serialization the add the type suffix to overcome the serialization limitations?

I would say that if the protocol (in this case JSON) doesn't support a specific type, the type information is silently lost.

string derived types will devolve to string

number types will devolve to number

Agreed.

Any approach we take, encoding the data into certain message formats will be lossy. I don't see this as too much of an issue as long as the representation is consistent enough that the values can be ducktyped. For example, an IPv4 address encoded as a JSON string as "1.2.3.4" can easily be identified and handled, however it encoded as an integer value is not as easily identified and handled.

The real problem is the values that have different representations in strings versus binary. Values like integers, ip addresses, timestamps immediately come to mind. FQDN, email, URI are always represented as strings -- I would *not* suggest adding types to these as their will be issues with identifying the proper syntax and supporting it. This especially becomes problematic when you want to log an event where someone tried to send an email to an invalidly formed email address. Too many issues and problems to handle in the actual event library, just parse it as a string and allow the application to provide/handle it.

I think the best way forward is to identify these values that we need to support that may have different representations. One way is to think about developing a binary message format (or leveraging an existing format: BSON, google protobuf, etc.)

* int32, int64 * uint32, uint64 * double * bool * ipv4, ipv6, mac * timestamp (msec since epoch, tzone offset) * string * octet array (base64 or hex encode the value to fit into XML/JSON) * UUID/GUID

Precision Arithmetic -----------------------------

Also, I do not thing we should support arbitrary precision numbers or arithmetic in logs. If a number does not fit in the native integer/double types, it can be stored as a string or octet array. I cannot see a use case where logs will need to support huge numbers or require precision arithmetic libraries. We need to be able to search/query across logs, so will need support for integer/float comparisons. I see it as a requirement of the supporting application to support arbitrary precision, not of the native log library.

Dmitri Pal

12:28 p.m.

On 03/24/2012 09:50 AM, William Heinbockel wrote:

...

On Sat, Mar 24, 2012 at 7:18 AM, david@lang.hm wrote:

...
On Fri, 23 Mar 2012, Dmitri Pal wrote:

...
...
...
What do you think?

it's redundant to have TYPE_* and the type as part of the field name, we should do one or the other, not both.

OK works for me but how it would look over the wire in JSON? Should the serialization the add the type suffix to overcome the serialization limitations?

I would say that if the protocol (in this case JSON) doesn't support a specific type, the type information is silently lost.

string derived types will devolve to string

number types will devolve to number

Agreed.

Any approach we take, encoding the data into certain message formats will be lossy. I don't see this as too much of an issue as long as the representation is consistent enough that the values can be ducktyped. For example, an IPv4 address encoded as a JSON string as "1.2.3.4" can easily be identified and handled, however it encoded as an integer value is not as easily identified and handled.

The real problem is the values that have different representations in strings versus binary. Values like integers, ip addresses, timestamps immediately come to mind. FQDN, email, URI are always represented as strings -- I would *not* suggest adding types to these as their will be issues with identifying the proper syntax and supporting it. This especially becomes problematic when you want to log an event where someone tried to send an email to an invalidly formed email address. Too many issues and problems to handle in the actual event library, just parse it as a string and allow the application to provide/handle it.

I think the best way forward is to identify these values that we need to support that may have different representations. One way is to think about developing a binary message format (or leveraging an existing format: BSON, google protobuf, etc.)

int32, int64

uint32, uint64

double

bool

ipv4, ipv6, mac

timestamp (msec since epoch, tzone offset)

string

octet array (base64 or hex encode the value to fit into XML/JSON)

UUID/GUID

Precision Arithmetic

Also, I do not thing we should support arbitrary precision numbers or arithmetic in logs. If a number does not fit in the native integer/double types, it can be stored as a string or octet array. I cannot see a use case where logs will need to support huge numbers or require precision arithmetic libraries. We need to be able to search/query across logs, so will need support for integer/float comparisons. I see it as a requirement of the supporting application to support arbitrary precision, not of the native log library.

Here is the rules I suggest we follow: 1) Identify the types of data we commit to logically recognize at the API level - the list above is a good starting point. These types are known at the API level. Our goal is to preserve these types in all transformations otherwise IMO we did not accomplish our goal. Dropping types in transition that have been clearly identified at the API level just because we chose transport that does not support it is a non-starter.

2) Map them to the C types:

* int32, int64, uint32, uint64, double - map one to one to C * bool - int or char (implementation detail) * string - null terminated string i.e. char * with length calculated nased on the NULL terminated string * octet array - void * and it needs to keep the length, lenght is a required argument at the API level BTW * timestamp - is actually two parts but I would argue that timestamp is added at the moment of logging automatically so it is not exposed at the API level but kept internally as pair of numeric fields (msec since epoch, tzone offset)

All the rest: * ipv4, ipv6, mac * UUID/GUID * host/fqdn * email

Are subtypes of string meaning that on the API level they will be passed to us via non string type but char * as data.

Anything else that we do not recognize as a subtype has to be expressed with existing type

3) If the transport format (JSON, XML, you name it) does not support the type we chose follow a convention of expressing subtypes via the name of the attribute.

So in JSON serialization I will add the suffix and in XML it can be a type element or a type attribute as Bill suggested.

I plan to make a first stab at this API over the weekend.

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Botond Botyanszki

12:57 p.m.

On Sat, 24 Mar 2012 12:28:25 -0400 Dmitri Pal dpal@redhat.com wrote:

...

Map them to the C types:

int32, int64, uint32, uint64, double - map one to one to C

What's the benefit of having both int32 and int64 and unsigned versions? I still think that this is an overkill instead of using only a single 'number' type which would be int64. Just like in JSON. For the cases where this is not optimal, the custom types can solve this problem. - The number does not fit into signed 64 bit: create a special type (bigint) and use that. - The number is a lot smaller (e.g. enum) and 64 bit is considered a waste for its storage: create a special type (such as severitylevel) where you know that the value can fit in a smaller storage (e.g. uint8). Having this many numeric types also complicates code when you need to implement operations with these.

...

If the transport format (JSON, XML, you name it) does not support the

type we chose follow a convention of expressing subtypes via the name of the attribute.

So in JSON serialization I will add the suffix and in XML it can be a type element or a type attribute as Bill suggested.

Since JSON cannot support types without such abuse, I'm also against this suffixing.

Regards, Botond

Dmitri Pal

1:14 p.m.

On 03/24/2012 12:57 PM, Botond Botyanszki wrote:

...

On Sat, 24 Mar 2012 12:28:25 -0400 Dmitri Pal dpal@redhat.com wrote:

...

Map them to the C types:

int32, int64, uint32, uint64, double - map one to one to C

What's the benefit of having both int32 and int64 and unsigned versions? I still think that this is an overkill instead of using only a single 'number' type which would be int64. Just like in JSON. For the cases where this is not optimal, the custom types can solve this problem.

The number does not fit into signed 64 bit: create a special type (bigint) and use that.

The number is a lot smaller (e.g. enum) and 64 bit is considered a waste for its storage: create a special type (such as severitylevel) where you know that the value can fit in a smaller storage (e.g. uint8).

Having this many numeric types also complicates code when you need to implement operations with these.

I think it is not that big of a deal. We can start with one number format and then see if there is a need for more.

...

...

If the transport format (JSON, XML, you name it) does not support the

type we chose follow a convention of expressing subtypes via the name of the attribute.

So in JSON serialization I will add the suffix and in XML it can be a type element or a type attribute as Bill suggested.

Since JSON cannot support types without such abuse, I'm also against this suffixing.

Do you have something better in mind? Again loosing type info is a non starter.

We either need: 1) to pass types one using some kind of identifier (suffix, prefix, escaping - whatever we see reasonable) 2) do not support those types in the API and throttle everything through string leaving it to the recipient to consult standard + vendor dictionaries and deduce type 3) not use a format that can't express the types i.e JSON 4) extend JSON (create JSON2) to support the types

I think I covered all options. I see the 1) as the least evil but may be I am missing something.

Bill what is your take?

...

Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Botond Botyanszki

1:35 p.m.

On Sat, 24 Mar 2012 13:14:25 -0400 Dmitri Pal dpal@redhat.com wrote:

...

On 03/24/2012 12:57 PM, Botond Botyanszki wrote:

...
On Sat, 24 Mar 2012 12:28:25 -0400 Dmitri Pal dpal@redhat.com wrote:

...

Map them to the C types:

int32, int64, uint32, uint64, double - map one to one to C

What's the benefit of having both int32 and int64 and unsigned versions? I still think that this is an overkill instead of using only a single 'number' type which would be int64. Just like in JSON. For the cases where this is not optimal, the custom types can solve this problem.

The number does not fit into signed 64 bit: create a special type (bigint) and use that.

The number is a lot smaller (e.g. enum) and 64 bit is considered a waste for its storage: create a special type (such as severitylevel) where you know that the value can fit in a smaller storage (e.g. uint8).

Having this many numeric types also complicates code when you need to implement operations with these.

I think it is not that big of a deal. We can start with one number format and then see if there is a need for more.

Sure, it is always easier to extend later than trying to unify them later.

...

...
...
So in JSON serialization I will add the suffix and in XML it can be a type element or a type attribute as Bill suggested.

Since JSON cannot support types without such abuse, I'm also against this suffixing.

Do you have something better in mind? Again loosing type info is a non starter.

Yes, use a format which supports types properly (XML, BSON, homegrown). If you (as a user) want to stick with JSON, you should accept the fact that it cannot transfer (all) types. If keeping type information is a requirement, you should use another format.

...

We either need:

to pass types one using some kind of identifier (suffix, prefix,

escaping - whatever we see reasonable)

Whether you pass the type in the field name or with the value (the solution Bill suggested) is indifferent. This will only result in broken values or broken field names in software prepared to handle pure JSON only.

...

do not support those types in the API and throttle everything through

string leaving it to the recipient to consult standard + vendor dictionaries and deduce type

This should be the case for standard JSON.

...

not use a format that can't express the types i.e JSON

JSON should be supported at least for parsing, since a lot of tools/languages produce JSON.

...

extend JSON (create JSON2) to support the types

This is an option, but don't call it JSON then. Call it CEE, JSON2 or or something else.

...

I think I covered all options. I see the 1) as the least evil but may be I am missing something.

To me 1) and 4) are essentially the same. I support this idea as long as it is not called JSON.

Regards, Botond

Dmitri Pal

1:48 p.m.

On 03/24/2012 01:35 PM, Botond Botyanszki wrote:

...

On Sat, 24 Mar 2012 13:14:25 -0400 Dmitri Pal dpal@redhat.com wrote:

...
On 03/24/2012 12:57 PM, Botond Botyanszki wrote:

...
On Sat, 24 Mar 2012 12:28:25 -0400 Dmitri Pal dpal@redhat.com wrote:

...

Map them to the C types:

int32, int64, uint32, uint64, double - map one to one to C

What's the benefit of having both int32 and int64 and unsigned versions? I still think that this is an overkill instead of using only a single 'number' type which would be int64. Just like in JSON. For the cases where this is not optimal, the custom types can solve this problem.

The number does not fit into signed 64 bit: create a special type (bigint) and use that.

The number is a lot smaller (e.g. enum) and 64 bit is considered a waste for its storage: create a special type (such as severitylevel) where you know that the value can fit in a smaller storage (e.g. uint8).

Having this many numeric types also complicates code when you need to implement operations with these.

I think it is not that big of a deal. We can start with one number format and then see if there is a need for more.

Sure, it is always easier to extend later than trying to unify them later.

...
...
...
So in JSON serialization I will add the suffix and in XML it can be a type element or a type attribute as Bill suggested.

Since JSON cannot support types without such abuse, I'm also against this suffixing.

Do you have something better in mind? Again loosing type info is a non starter.

Yes, use a format which supports types properly (XML, BSON, homegrown). If you (as a user) want to stick with JSON, you should accept the fact that it cannot transfer (all) types. If keeping type information is a requirement, you should use another format.

...
We either need:

to pass types one using some kind of identifier (suffix, prefix,

escaping - whatever we see reasonable)

Whether you pass the type in the field name or with the value (the solution Bill suggested) is indifferent. This will only result in broken values or broken field names in software prepared to handle pure JSON only.

I would agree but there is IMO no software now that expects log data in JSON so is there anything really that would be broken by this approach? Aren't we the first to suggest it as a transport between the library and syslog? We are on the both side and we can decide what it would be. I do not think passing type in value is a good idea at it might affect blind parsing but if appended to the field name it is just a different field name. Generic JSON parsers would not be broken. I think it should be syntactically correct JSON with added conventions. This is the "added conventions" that we are discussing here.

...

...

do not support those types in the API and throttle everything through

string leaving it to the recipient to consult standard + vendor dictionaries and deduce type

This should be the case for standard JSON.

...

not use a format that can't express the types i.e JSON

JSON should be supported at least for parsing, since a lot of tools/languages produce JSON.

...

extend JSON (create JSON2) to support the types

This is an option, but don't call it JSON then. Call it CEE, JSON2 or or something else.

...
I think I covered all options. I see the 1) as the least evil but may be I am missing something.

To me 1) and 4) are essentially the same. I support this idea as long as it is not called JSON.

Let us call it LOGSON or LSON= synthetically correct JSON so that it can be parsed correctly by any JSON parser with logging conventions on top so that the LOGSON/LSON parser can get smarts out of it.

Deal?

...

Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Botond Botyanszki

2:08 p.m.

On Sat, 24 Mar 2012 13:48:39 -0400 Dmitri Pal dpal@redhat.com wrote:

...

...
...
We either need:

to pass types one using some kind of identifier (suffix, prefix,

escaping - whatever we see reasonable)

Whether you pass the type in the field name or with the value (the solution Bill suggested) is indifferent. This will only result in broken values or broken field names in software prepared to handle pure JSON only.

I would agree but there is IMO no software now that expects log data in JSON so is there anything really that would be broken by this approach?

Not really. syslog-ng, rsyslog, nxlog can now all process JSON and they are not (yet) prepared to handle this new format. There is also GELF which is JSON internally. Probably there are a bunch of others which can handle JSON formatted logs (e.g. loggly). For this reason the the current standard JSON format will need to stay as well without supporting types.

...

Aren't we the first to suggest it as a transport between the library and syslog?

Encapsulated within syslog: probably yes Used as a format for structured log messages: no

...

We are on the both side and we can decide what it would be. I do not think passing type in value is a good idea at it might affect blind parsing but if appended to the field name it is just a different field name. Generic JSON parsers would not be broken. I think it should be syntactically correct JSON with added conventions. This is the "added conventions" that we are discussing here.

Sure , adding this to the field name is somewhat better but it still breaks assumptions when trying to parse this as standard JSON.

...

...
...

extend JSON (create JSON2) to support the types

This is an option, but don't call it JSON then. Call it CEE, JSON2 or or something else.

...
I think I covered all options. I see the 1) as the least evil but may be I am missing something.

To me 1) and 4) are essentially the same. I support this idea as long as it is not called JSON.

Let us call it LOGSON or LSON= synthetically correct JSON so that it can be parsed correctly by any JSON parser with logging conventions on top so that the LOGSON/LSON parser can get smarts out of it.

Deal?

Sounds great to me.

Regards, Botond

Dmitri Pal

2:20 p.m.

On 03/24/2012 02:08 PM, Botond Botyanszki wrote:

...

On Sat, 24 Mar 2012 13:48:39 -0400 Dmitri Pal dpal@redhat.com wrote:

...
...
...
We either need:

to pass types one using some kind of identifier (suffix, prefix,

escaping - whatever we see reasonable)

Whether you pass the type in the field name or with the value (the solution Bill suggested) is indifferent. This will only result in broken values or broken field names in software prepared to handle pure JSON only.

I would agree but there is IMO no software now that expects log data in JSON so is there anything really that would be broken by this approach?

Not really. syslog-ng, rsyslog, nxlog can now all process JSON and they are not (yet) prepared to handle this new format. There is also GELF which is JSON internally. Probably there are a bunch of others which can handle JSON formatted logs (e.g. loggly). For this reason the the current standard JSON format will need to stay as well without supporting types.

...
Aren't we the first to suggest it as a transport between the library and syslog?

Encapsulated within syslog: probably yes Used as a format for structured log messages: no

...
We are on the both side and we can decide what it would be. I do not think passing type in value is a good idea at it might affect blind parsing but if appended to the field name it is just a different field name. Generic JSON parsers would not be broken. I think it should be syntactically correct JSON with added conventions. This is the "added conventions" that we are discussing here.

Sure , adding this to the field name is somewhat better but it still breaks assumptions when trying to parse this as standard JSON.

I am sorry I do not get it. What assumptions does it break if it can be parsed by JSON parser? What it breaks is assumptions about the field names and the logic/heuristic/special processing behind it. I can understand that but this would mean that some fields would not be recognized but still present and parsed correctly without information being lost. This is exactly the space that we need to regulate and IMO trying to regulate and standardize: what are the valid assumptions (I call them conventions) and what are not. Some existing conventions might be broken but aren't they already broken by the fact that we are trying to define a standard that does not exist yet?

However I can agree that we should be nice so the compromise would be to have a flag in the serialization that would allow adding suffixes or not. Effectively generate output in pure JSON with the type information lost or in LSON where types are presented as suffixes.

Might be a good compromise.

...

...
...
...

extend JSON (create JSON2) to support the types

This is an option, but don't call it JSON then. Call it CEE, JSON2 or or something else.

...
I think I covered all options. I see the 1) as the least evil but may be I am missing something.

To me 1) and 4) are essentially the same. I support this idea as long as it is not called JSON.

Let us call it LOGSON or LSON= synthetically correct JSON so that it can be parsed correctly by any JSON parser with logging conventions on top so that the LOGSON/LSON parser can get smarts out of it.

Deal?

Sounds great to me.

Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

William Heinbockel

26 Mar 26 Mar

11:03 a.m.

...

...
Do you have something better in mind? Again loosing type info is a non starter.

Yes, use a format which supports types properly (XML, BSON, homegrown). If you (as a user) want to stick with JSON, you should accept the fact that it cannot transfer (all) types. If keeping type information is a requirement, you should use another format.

...
We either need:

to pass types one using some kind of identifier (suffix, prefix,

escaping - whatever we see reasonable)

Whether you pass the type in the field name or with the value (the solution Bill suggested) is indifferent. This will only result in broken values or broken field names in software prepared to handle pure JSON only.

I agree that this does feel like a hack and will require custom parsers, and will probably lead to interoperability issues later.

...

...

do not support those types in the API and throttle everything through

string leaving it to the recipient to consult standard + vendor dictionaries and deduce type

This should be the case for standard JSON.

This is the current case and I see no real strong case to support explicit type declarations. JSON, CSV, most XML (without @type or XML Schemas)... If explicit types are necessary, use XML/binary/etc.

...

...

not use a format that can't express the types i.e JSON

JSON should be supported at least for parsing, since a lot of tools/languages produce JSON.

I don't understand what the big deal is that JSON doesn't support types... existing applications seem to work fine without it.

...

...

extend JSON (create JSON2) to support the types

This is an option, but don't call it JSON then. Call it CEE, JSON2 or or something else.

I don't like this option. We will just repeat all of these discussions with a slightly different focus.

...

...
I think I covered all options. I see the 1) as the least evil but may be I am missing something.

To me 1) and 4) are essentially the same. I support this idea as long as it is not called JSON.

Botond makes a good point. Either we create a new spec base upon JSON (which may be backwards compatible) or don't force explicit types If the type information is embedded in either the name or value, there needs to be some indication that this is a specialized form of JSON for it to be parsed and handled correctly.

The whole goal of CEE is to make parsing events easy... creating new formats that require specialized parsers is not easy.

I still don't see a need for explicit type support in all syntaxes -- If we have a limited listing of types and formats, values can easily be ducktyped. The only "loss" is that you will lose some validation -- if you present to me an invalid number, ipv4 address, etc. it will be treated as a string as I cannot raise an invalid type format error.

Dmitri Pal

11:54 a.m.

On 03/26/2012 11:03 AM, William Heinbockel wrote:

...

...
...
Do you have something better in mind? Again loosing type info is a non starter.

Yes, use a format which supports types properly (XML, BSON, homegrown). If you (as a user) want to stick with JSON, you should accept the fact that it cannot transfer (all) types. If keeping type information is a requirement, you should use another format.

...
We either need:

to pass types one using some kind of identifier (suffix, prefix,

escaping - whatever we see reasonable)

Whether you pass the type in the field name or with the value (the solution Bill suggested) is indifferent. This will only result in broken values or broken field names in software prepared to handle pure JSON only.

I agree that this does feel like a hack and will require custom parsers, and will probably lead to interoperability issues later.

...
...

do not support those types in the API and throttle everything through

string leaving it to the recipient to consult standard + vendor dictionaries and deduce type

This should be the case for standard JSON.

This is the current case and I see no real strong case to support explicit type declarations. JSON, CSV, most XML (without @type or XML Schemas)... If explicit types are necessary, use XML/binary/etc.

...
...

not use a format that can't express the types i.e JSON

JSON should be supported at least for parsing, since a lot of tools/languages produce JSON.

I don't understand what the big deal is that JSON doesn't support types... existing applications seem to work fine without it.

...
...

extend JSON (create JSON2) to support the types

This is an option, but don't call it JSON then. Call it CEE, JSON2 or or something else.

I don't like this option. We will just repeat all of these discussions with a slightly different focus.

...
...
I think I covered all options. I see the 1) as the least evil but may be I am missing something.

To me 1) and 4) are essentially the same. I support this idea as long as it is not called JSON.

Botond makes a good point. Either we create a new spec base upon JSON (which may be backwards compatible) or don't force explicit types If the type information is embedded in either the name or value, there needs to be some indication that this is a specialized form of JSON for it to be parsed and handled correctly.

The whole goal of CEE is to make parsing events easy... creating new formats that require specialized parsers is not easy.

I still don't see a need for explicit type support in all syntaxes -- If we have a limited listing of types and formats, values can easily be ducktyped. The only "loss" is that you will lose some validation -- if you present to me an invalid number, ipv4 address, etc. it will be treated as a string as I cannot raise an invalid type format error. _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

Let me summarize a bit:

1) We have more types that we want to express than C has. This creates the need for subtypes. This means that storage (i.e. libcollection in my case needs to support subtypes). It has to be extended then. I did not have a chance to do a lot of coding over the weekend but I had some time to think about it.

2) We have the situation when our serialization format between the library and the consumer (syslog) i.e JSON would not support some of those types and subtypes natively. Based on the discussion there are two options: drop the extended type info that does not fit into JSON (which IMO is wrong) and create some kind of new expression on top of syntactically correct JSON (which I think is right). I really do not understand why we can't do both. Why can't we create a spec and follow it as people on this project are syslog developers? My be there should be an ENV variable or some new call exposed by syslog that would reveal its ability to parse extended type information out of the JSON. I will leave it to Botond and Rainer to define what is a better method. I would prefer something like syslog_capability() call that would return to the library the information about what syslog can do so that library would be able to choose best available serialization formatting.

3) We have different opinions about different extra types. Let us start with something that we have time to implement and add as we go.

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Botond Botyanszki

12:09 p.m.

On Mon, 26 Mar 2012 11:54:57 -0400 Dmitri Pal dpal@redhat.com wrote:

...

We have the situation when our serialization format between the

library and the consumer (syslog) i.e JSON would not support some of those types and subtypes natively. Based on the discussion there are two options: drop the extended type info that does not fit into JSON (which IMO is wrong) and create some kind of new expression on top of syntactically correct JSON (which I think is right). I really do not understand why we can't do both.

We can do both. What we can't do is have one and claim it does both.

...

Why can't we create a spec and follow it as people on this project are syslog developers? My be there should be an ENV variable or some new call exposed by syslog that would reveal its ability to parse extended type information out of the JSON.

If we are talking about libumberlog here, the format (XML/JSON/LSON) could be specified via the option parameter passed to openlog(). Not sure whether Gergely wants more complexity in the lib by adding output support for all these. For ELAPI, probably there is an init method as well where you can set the output format. In syslog-ng/rsyslog/nxlog this will be some kind of a config parameter. So I don't see any reason why all these couldn't be supported, or even a binary serialized format (e.g. BSON).

Regards, Botond

Dmitri Pal

12:21 p.m.

On 03/26/2012 12:09 PM, Botond Botyanszki wrote:

...

On Mon, 26 Mar 2012 11:54:57 -0400 Dmitri Pal dpal@redhat.com wrote:

...

We have the situation when our serialization format between the

library and the consumer (syslog) i.e JSON would not support some of those types and subtypes natively. Based on the discussion there are two options: drop the extended type info that does not fit into JSON (which IMO is wrong) and create some kind of new expression on top of syntactically correct JSON (which I think is right). I really do not understand why we can't do both.

We can do both. What we can't do is have one and claim it does both.

Agree.

...

...
Why can't we create a spec and follow it as people on this project are syslog developers? My be there should be an ENV variable or some new call exposed by syslog that would reveal its ability to parse extended type information out of the JSON.

If we are talking about libumberlog here, the format (XML/JSON/LSON) could be specified via the option parameter passed to openlog().

I think this is reverse. The library should detect what the log supports and not try to pass in the format the syslog does not support. The library should/would pick the best format that the syslog would support but the syslog needs to pass it to the library first.

So I see it logically a following code in the library init() function:

syslog_capability(&what_is supported)

openlog(selected_format)

Something like this.

...

Not sure whether Gergely wants more complexity in the lib by adding output support for all these. For ELAPI, probably there is an init method as well where you can set the output format. In syslog-ng/rsyslog/nxlog this will be some kind of a config parameter. So I don't see any reason why all these couldn't be supported, or even a binary serialized format (e.g. BSON).

Right. This is all making sense except that I can't do it myself. I am in process of bringing another engineer up to speed so that he could take over from me and really produce the code in a timely fashion.

...

Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

William Heinbockel

1:36 p.m.

...

Let me summarize a bit:

We have more types that we want to express than C has. This creates

the need for subtypes. This means that storage (i.e. libcollection in my case needs to support subtypes). It has to be extended then. I did not have a chance to do a lot of coding over the weekend but I had some time to think about it.

Regardless of the library and language, there will be type incompatibilities. You can usually assume that boolean, integer, decimal, and string support exists. What I am more concerned about is the actual message format and what types are supported there.

Right now, there are JSON C/Java/Python libraries that support way more types than can be represented in JSON.

I don't see what the issue is. The underlying data store doesn't have to support all of the different types; it is up to the application to make sure things are stored/recalled correctly. If you want to use libcollection as an event data store, you may have to write an intermediary wrapper to add type information to the collection, or (as you suggest) expand the libcollection capabilities to allow user defined types (though this might add other issues). I would probably define my own (type, value) struct to store atomic lumberjack event values in libcollection.

I think this opens the larger question: is libcollection intended to be a generic name-value data store or a lumberjack event-specific data store? I thought it was the former, but your comments now make me think it is the latter.

I don't think we should limit the needs of the log library just to suit the capabilities of libcollection.

...

We have the situation when our serialization format between the

library and the consumer (syslog) i.e JSON would not support some of those types and subtypes natively. Based on the discussion there are two options: drop the extended type info that does not fit into JSON (which IMO is wrong) and create some kind of new expression on top of syntactically correct JSON (which I think is right). I really do not understand why we can't do both. Why can't we create a spec and follow it as people on this project are syslog developers? My be there should be an ENV variable or some new call exposed by syslog that would reveal its ability to parse extended type information out of the JSON. I will leave it to Botond and Rainer to define what is a better method. I would prefer something like syslog_capability() call that would return to the library the information about what syslog can do so that library would be able to choose best available serialization formatting.

I still don't see any issue here. What ever happened to dynamic/implicit typing? Most text-based encodings (XML, JSON, YAML) support 1 or a small handful of types, yet there are plenty of libraries that are able to parse them and make available to the user a wider range of types than are supported.

We have an event structure that are lists of (name, type value) tuples. JSON parsers do not have support for explicit typing. If we want explicit typing in a JSON-like format, we need to define it.

...

We have different opinions about different extra types. Let us start

with something that we have time to implement and add as we go.

My initial suggestion was to do what Gergely did, just treat every name and value as a UTF-8 string.

An API can always add support for more value types to be encoded/decoded. User applications can provide more intelligence to determine exactly how to treat the values and determine whether that string that has the syntax of an IPv4 address should be treated as such.

Dmitri Pal

2:38 p.m.

On 03/26/2012 01:36 PM, William Heinbockel wrote:

...

...
Let me summarize a bit:

We have more types that we want to express than C has. This creates

the need for subtypes. This means that storage (i.e. libcollection in my case needs to support subtypes). It has to be extended then. I did not have a chance to do a lot of coding over the weekend but I had some time to think about it.

Regardless of the library and language, there will be type incompatibilities. You can usually assume that boolean, integer, decimal, and string support exists. What I am more concerned about is the actual message format and what types are supported there.

Right now, there are JSON C/Java/Python libraries that support way more types than can be represented in JSON.

I don't see what the issue is. The underlying data store doesn't have to support all of the different types; it is up to the application to make sure things are stored/recalled correctly.

libcollcetion is the storage. The log library elapi or selog (new name I started to use in the prototype I am putting together) is the library for logging that wrapps the libcollection.

...

If you want to use libcollection as an event data store, you may have to write an intermediary wrapper to add type information to the collection, or (as you suggest) expand the libcollection capabilities to allow user defined types (though this might add other issues).

Yes. This is exactly what I want to do.

...

I would probably define my own (type, value) struct to store atomic lumberjack event values in libcollection.

I think this is implementation detail. I would do it differently. I figured how.

...

I think this opens the larger question: is libcollection intended to be a generic name-value data store or a lumberjack event-specific data store? I thought it was the former, but your comments now make me think it is the latter.

It is a KVP store that was created to store event data but found other uses and became more generic.

...

I don't think we should limit the needs of the log library just to suit the capabilities of libcollection.

No we/I should add things to libcollection to provide necessary capabilities for logging. Not the way around. So we should not ben the goal to fit into libcollection. However there are some things that are already available and some would take time to add.

...

...

We have the situation when our serialization format between the

library and the consumer (syslog) i.e JSON would not support some of those types and subtypes natively. Based on the discussion there are two options: drop the extended type info that does not fit into JSON (which IMO is wrong) and create some kind of new expression on top of syntactically correct JSON (which I think is right). I really do not understand why we can't do both. Why can't we create a spec and follow it as people on this project are syslog developers? My be there should be an ENV variable or some new call exposed by syslog that would reveal its ability to parse extended type information out of the JSON. I will leave it to Botond and Rainer to define what is a better method. I would prefer something like syslog_capability() call that would return to the library the information about what syslog can do so that library would be able to choose best available serialization formatting.

I still don't see any issue here. What ever happened to dynamic/implicit typing? Most text-based encodings (XML, JSON, YAML) support 1 or a small handful of types, yet there are plenty of libraries that are able to parse them and make available to the user a wider range of types than are supported.

We have an event structure that are lists of (name, type value) tuples. JSON parsers do not have support for explicit typing. If we want explicit typing in a JSON-like format, we need to define it.

I do not think we are talking about the same thing here.

I do not think that we need explicit typing for everything. Only for specific predefined by spec (i.e defined by us) values. If the software does not understand "sub type suffix" it will treat as a part of the field and I do not see it being a big deal. If the parser is smart and taught to parse out the subtype out of it it would be able to validate it better but nothing would be broken.

Here is the example:

C API will have:

api_call(... , "host", TYPE_IP, "127.0.0.1") <- I use string as an example. We can use binary if you want.

JSON output from the library will have:

"host:ipv4": "127.0.0.1"

It is a valid JSON field. The standard JSON parser will assume that "host:ipv4" is a field and be done with it. A smart parser inside syslog will understand that ":" separates the subtype suffix and would extract it and do validation.

I fail to see what is wrong with this approach.

...

...

We have different opinions about different extra types. Let us start

with something that we have time to implement and add as we go.

My initial suggestion was to do what Gergely did, just treat every name and value as a UTF-8 string.

An API can always add support for more value types to be encoded/decoded. User applications can provide more intelligence to determine exactly how to treat the values and determine whether that string that has the syntax of an IPv4 address should be treated as such.

There is no value in having an advanced library that does not do basic things. It has to be feature rich enough to be attractive over the ul_syslog variant.

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

kroberts＠redhat.com

3:03 p.m.

On 03/26/2012 02:38 PM, Dmitri Pal wrote:

...

I do not think that we need explicit typing for everything. Only for specific predefined by spec (i.e defined by us) values. If the software does not understand "sub type suffix" it will treat as a part of the field and I do not see it being a big deal. If the parser is smart and taught to parse out the subtype out of it it would be able to validate it better but nothing would be broken.

The spec has this capability but it is implemented slightly differently than as you suggest. There are certain well known keys (time, localHostName, ppid, uid, etc.) which are strongly typed in the XSD. Therefore, if you see some JSON like this...

{ "time": "2001-12-31T12:00:00", "level": "WARN", "localAddr" : "192.168.1.1", "somUserDefinedFieldNotInTheXSD" : "abc123" }

It would be safe to assume that: - time == Date and time (this is defined in the XSD) - level == enumeration (this is defined in the XSD) - localAddr == an ip address (this is defined in the XSD) - somUserDefinedFieldNotInTheXSD == Could be anything.

Cheers

Botond Botyanszki

3:12 p.m.

On Mon, 26 Mar 2012 15:03:15 -0400 Keith Robertson kroberts@redhat.com wrote:

...

On 03/26/2012 02:38 PM, Dmitri Pal wrote:

...
I do not think that we need explicit typing for everything. Only for specific predefined by spec (i.e defined by us) values. If the software does not understand "sub type suffix" it will treat as a part of the field and I do not see it being a big deal. If the parser is smart and taught to parse out the subtype out of it it would be able to validate it better but nothing would be broken.

The spec has this capability but it is implemented slightly differently than as you suggest. There are certain well known keys (time, localHostName, ppid, uid, etc.) which are strongly typed in the XSD. Therefore, if you see some JSON like this...

{ "time": "2001-12-31T12:00:00", "level": "WARN", "localAddr" : "192.168.1.1", "somUserDefinedFieldNotInTheXSD" : "abc123" }

It would be safe to assume that:

time == Date and time (this is defined in the XSD)

level == enumeration (this is defined in the XSD)

localAddr == an ip address (this is defined in the XSD)

somUserDefinedFieldNotInTheXSD == Could be anything.

This is also what I suggested earlier. In addition, if the type is also specified along with the value, it would override the default and give a hint to the reader/perser if the field is not in the list of well-known fields, i.e.: { ... "somUserDefinedFieldNotInTheXSD:someUserDefinedType" : "abc123" } This is not necessarily JSON/LSON but can be XML with the 'type' attribute missing or present.

Regards, Botond

kroberts＠redhat.com

3:38 p.m.

On 03/26/2012 03:12 PM, Botond Botyanszki wrote:

...

On Mon, 26 Mar 2012 15:03:15 -0400 Keith Robertsonkroberts@redhat.com wrote:

...
On 03/26/2012 02:38 PM, Dmitri Pal wrote:

...
I do not think that we need explicit typing for everything. Only for specific predefined by spec (i.e defined by us) values. If the software does not understand "sub type suffix" it will treat as a part of the field and I do not see it being a big deal. If the parser is smart and taught to parse out the subtype out of it it would be able to validate it better but nothing would be broken.

The spec has this capability but it is implemented slightly differently than as you suggest. There are certain well known keys (time, localHostName, ppid, uid, etc.) which are strongly typed in the XSD. Therefore, if you see some JSON like this...

{ "time": "2001-12-31T12:00:00", "level": "WARN", "localAddr" : "192.168.1.1", "somUserDefinedFieldNotInTheXSD" : "abc123" }

It would be safe to assume that:

time == Date and time (this is defined in the XSD)

level == enumeration (this is defined in the XSD)

localAddr == an ip address (this is defined in the XSD)

somUserDefinedFieldNotInTheXSD == Could be anything.

This is also what I suggested earlier.

Must have missed that.

...

In addition, if the type is also specified along with the value, it would override the default and give a hint to the reader/perser if the field is not in the list of well-known fields, i.e.: { ... "somUserDefinedFieldNotInTheXSD:someUserDefinedType" : "abc123" } This is not necessarily JSON/LSON but can be XML with the 'type' attribute missing or present.

OK. I don't really have a problem with this. Seems like a reasonable way to convey typing for JSON. However, it wouldn't work for XML...

<p:asdf:jkl>look ma no hands</p:asdf:jkl> <-- That element name is invalid. AFAIK.

I guess we should document that this is a legal option for JSON in the specification.

Bill, suggestions on verbiage... if you agree?

...

Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

Botond Botyanszki

3:50 p.m.

On Mon, 26 Mar 2012 15:38:54 -0400 Keith Robertson kroberts@redhat.com wrote:

...

On 03/26/2012 03:12 PM, Botond Botyanszki wrote:

...
On Mon, 26 Mar 2012 15:03:15 -0400 Keith Robertsonkroberts@redhat.com wrote:

...
On 03/26/2012 02:38 PM, Dmitri Pal wrote:

...
I do not think that we need explicit typing for everything. Only for specific predefined by spec (i.e defined by us) values. If the software does not understand "sub type suffix" it will treat as a part of the field and I do not see it being a big deal. If the parser is smart and taught to parse out the subtype out of it it would be able to validate it better but nothing would be broken.

The spec has this capability but it is implemented slightly differently than as you suggest. There are certain well known keys (time, localHostName, ppid, uid, etc.) which are strongly typed in the XSD. Therefore, if you see some JSON like this...

{ "time": "2001-12-31T12:00:00", "level": "WARN", "localAddr" : "192.168.1.1", "somUserDefinedFieldNotInTheXSD" : "abc123" }

It would be safe to assume that:

time == Date and time (this is defined in the XSD)

level == enumeration (this is defined in the XSD)

localAddr == an ip address (this is defined in the XSD)

somUserDefinedFieldNotInTheXSD == Could be anything.

This is also what I suggested earlier.

Must have missed that.

...
In addition, if the type is also specified along with the value, it would override the default and give a hint to the reader/perser if the field is not in the list of well-known fields, i.e.: { ... "somUserDefinedFieldNotInTheXSD:someUserDefinedType" : "abc123" } This is not necessarily JSON/LSON but can be XML with the 'type' attribute missing or present.

OK. I don't really have a problem with this. Seems like a reasonable way to convey typing for JSON. However, it wouldn't work for XML...

<p:asdf:jkl>look ma no hands</p:asdf:jkl> <-- That element name is invalid. AFAIK.

For XML I meant: <someUserDefinedFieldNotInTheXSD type='someUserDefinedType'>abc123</someUserDefinedFieldNotInTheXSD> vs <someUserDefinedFieldNotInTheXSD>abc123</someUserDefinedFieldNotInTheXSD>

Regards, Botond

kroberts＠redhat.com

4:03 p.m.

On 03/26/2012 03:50 PM, Botond Botyanszki wrote:

...

On Mon, 26 Mar 2012 15:38:54 -0400 Keith Robertsonkroberts@redhat.com wrote:

...
On 03/26/2012 03:12 PM, Botond Botyanszki wrote:

...
On Mon, 26 Mar 2012 15:03:15 -0400 Keith Robertsonkroberts@redhat.com wrote:

...
On 03/26/2012 02:38 PM, Dmitri Pal wrote:

...
I do not think that we need explicit typing for everything. Only for specific predefined by spec (i.e defined by us) values. If the software does not understand "sub type suffix" it will treat as a part of the field and I do not see it being a big deal. If the parser is smart and taught to parse out the subtype out of it it would be able to validate it better but nothing would be broken.

The spec has this capability but it is implemented slightly differently than as you suggest. There are certain well known keys (time, localHostName, ppid, uid, etc.) which are strongly typed in the XSD. Therefore, if you see some JSON like this...
{
     "time": "2001-12-31T12:00:00",
     "level": "WARN",
     "localAddr" : "192.168.1.1",
     "somUserDefinedFieldNotInTheXSD" : "abc123"
}
It would be safe to assume that:

time == Date and time (this is defined in the XSD)

level == enumeration (this is defined in the XSD)

localAddr == an ip address (this is defined in the XSD)

somUserDefinedFieldNotInTheXSD == Could be anything.
This is also what I suggested earlier.
Must have missed that.

...
In addition, if the type is also specified along with the value, it would override the default and give a hint to the reader/perser if the field is not in the list of well-known fields, i.e.: { ... "somUserDefinedFieldNotInTheXSD:someUserDefinedType" : "abc123" } This is not necessarily JSON/LSON but can be XML with the 'type' attribute missing or present.

OK. I don't really have a problem with this. Seems like a reasonable way to convey typing for JSON. However, it wouldn't work for XML...

<p:asdf:jkl>look ma no hands</p:asdf:jkl> <-- That element name is invalid. AFAIK.
For XML I meant: <someUserDefinedFieldNotInTheXSD type='someUserDefinedType'>abc123</someUserDefinedFieldNotInTheXSD> vs <someUserDefinedFieldNotInTheXSD>abc123</someUserDefinedFieldNotInTheXSD>

I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

- Predefined types (i.e. uid, gid, localAddr, etc.) - Extend the schema. Technically this is possible, but I doubt it will be very popular. It's more work. - Supply typing with the key: -- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

...

Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

David Lang

4:14 p.m.

On Mon, 26 Mar 2012, Keith Robertson wrote:

...

I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

Predefined types (i.e. uid, gid, localAddr, etc.)

Extend the schema. Technically this is possible, but I doubt it will

be very popular. It's more work.

Supply typing with the key:

-- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

I really don't like this last option. it will make converting from one message type to another problematic, and it also makes it so that the same logical field can show up differently to JSON aware endpoints depending on what type the sender happened to use.

especially with the numeric data, how many different types can be used to contain the number '42'? (remember to include string)

If you make the sender's type be part of the field name string, you are going to run into cases where the receiving software is just JSON, not LSON and it will run into problems by treating the type as part of the name.

I think that not transmitting the type is much better than doing so in a way that can confuse recipients.

David Lang

Dmitri Pal

4:25 p.m.

On 03/26/2012 04:14 PM, david@lang.hm wrote:

...

On Mon, 26 Mar 2012, Keith Robertson wrote:

...
I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

Predefined types (i.e. uid, gid, localAddr, etc.)

Extend the schema. Technically this is possible, but I doubt it

will be very popular. It's more work.

Supply typing with the key:

-- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

I really don't like this last option. it will make converting from one message type to another problematic, and it also makes it so that the same logical field can show up differently to JSON aware endpoints depending on what type the sender happened to use.

especially with the numeric data, how many different types can be used to contain the number '42'? (remember to include string)

If you make the sender's type be part of the field name string, you are going to run into cases where the receiving software is just JSON, not LSON and it will run into problems by treating the type as part of the name.

I think that not transmitting the type is much better than doing so in a way that can confuse recipients.

Ok. We are at the core of the argument here!

Why do you think that there will be JSON end points that do not understand it? This is a fundamental difference. The whole point is to emit LSON data to the syslog implementations that indicate that they understand LSON. This is why I suggested the capability function in the syslog.

The same thing should work between syslog and remote collection server. If the server indicates it supports LSON syslog will send LSON if only JSON then it will loose the types and emit just JSON.

...

David Lang _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

David Lang

5:30 p.m.

On Mon, 26 Mar 2012, Dmitri Pal wrote:

...

On 03/26/2012 04:14 PM, david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Keith Robertson wrote:

...
I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

Predefined types (i.e. uid, gid, localAddr, etc.)

Extend the schema. Technically this is possible, but I doubt it

will be very popular. It's more work.

Supply typing with the key:

-- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

I really don't like this last option. it will make converting from one message type to another problematic, and it also makes it so that the same logical field can show up differently to JSON aware endpoints depending on what type the sender happened to use.

especially with the numeric data, how many different types can be used to contain the number '42'? (remember to include string)

If you make the sender's type be part of the field name string, you are going to run into cases where the receiving software is just JSON, not LSON and it will run into problems by treating the type as part of the name.

I think that not transmitting the type is much better than doing so in a way that can confuse recipients.

Ok. We are at the core of the argument here!

Why do you think that there will be JSON end points that do not understand it? This is a fundamental difference. The whole point is to emit LSON data to the syslog implementations that indicate that they understand LSON. This is why I suggested the capability function in the syslog.

remeber that syslog is just the transport and routing, very little processing is done by syslog. As a result, syslog really doesn't care about the types. The output of syslog to the apps that analyze the data is where typs can matter, but it is also where deviations from standard JSON are more likly to be a problem. While there will be apps that are written specifically for lumberjack, there will be far more that are written for other purposes and just happen to work with lumberjack because we use a generic enough serialization scheme that they can already understand the data.

Having the syslog daemon re-write the log line into a different format is relatively expensive. In some cases it will be done anyway (database inserts for example) and will not matter. I would expect that the cost of reformatting the message would be significantly more than the savings of receiving the message in BSON instead of JSON for example.

...

The same thing should work between syslog and remote collection server. If the server indicates it supports LSON syslog will send LSON if only JSON then it will loose the types and emit just JSON.

As long as we make sure that we are being clear tha the version with type info embedded is not JSON, but is instead a new protocol, I have less of a problem (but in that case, why start with JSON instead of something more efficient like BSON?)

David Lang

Dmitri Pal

6:59 p.m.

On 03/26/2012 05:30 PM, david@lang.hm wrote:

...

On Mon, 26 Mar 2012, Dmitri Pal wrote:

...
On 03/26/2012 04:14 PM, david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Keith Robertson wrote:

...
I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

Predefined types (i.e. uid, gid, localAddr, etc.)

Extend the schema. Technically this is possible, but I doubt it

will be very popular. It's more work.

Supply typing with the key:

-- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

I really don't like this last option. it will make converting from one message type to another problematic, and it also makes it so that the same logical field can show up differently to JSON aware endpoints depending on what type the sender happened to use.

especially with the numeric data, how many different types can be used to contain the number '42'? (remember to include string)

If you make the sender's type be part of the field name string, you are going to run into cases where the receiving software is just JSON, not LSON and it will run into problems by treating the type as part of the name.

I think that not transmitting the type is much better than doing so in a way that can confuse recipients.

Ok. We are at the core of the argument here!

Why do you think that there will be JSON end points that do not understand it? This is a fundamental difference. The whole point is to emit LSON data to the syslog implementations that indicate that they understand LSON. This is why I suggested the capability function in the syslog.

remeber that syslog is just the transport and routing, very little processing is done by syslog. As a result, syslog really doesn't care about the types. The output of syslog to the apps that analyze the data is where typs can matter, but it is also where deviations from standard JSON are more likly to be a problem. While there will be apps that are written specifically for lumberjack, there will be far more that are written for other purposes and just happen to work with lumberjack because we use a generic enough serialization scheme that they can already understand the data.

Having the syslog daemon re-write the log line into a different format is relatively expensive. In some cases it will be done anyway (database inserts for example) and will not matter. I would expect that the cost of reformatting the message would be significantly more than the savings of receiving the message in BSON instead of JSON for example.

May be I am wrong but based on the conversation with Rainer I got the impression that this is exacly what rsyslog is going to do and that other syslog implementations woudl follow. We need to talk about this.

...

...
The same thing should work between syslog and remote collection server. If the server indicates it supports LSON syslog will send LSON if only JSON then it will loose the types and emit just JSON.

As long as we make sure that we are being clear tha the version with type info embedded is not JSON, but is instead a new protocol, I have less of a problem (but in that case, why start with JSON instead of something more efficient like BSON?)

JSON/LSON is human readable.

If we want binary format I would go with EXI http://en.wikipedia.org/wiki/Efficient_XML_Interchange and http://exip.sourceforge.net/

...

David Lang

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

David Lang

7:45 p.m.

On Mon, 26 Mar 2012, Dmitri Pal wrote:

...

On 03/26/2012 05:30 PM, david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Dmitri Pal wrote:

...
On 03/26/2012 04:14 PM, david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Keith Robertson wrote:

...
I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

Predefined types (i.e. uid, gid, localAddr, etc.)

Extend the schema. Technically this is possible, but I doubt it

will be very popular. It's more work.

Supply typing with the key:

-- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

I really don't like this last option. it will make converting from one message type to another problematic, and it also makes it so that the same logical field can show up differently to JSON aware endpoints depending on what type the sender happened to use.

especially with the numeric data, how many different types can be used to contain the number '42'? (remember to include string)

If you make the sender's type be part of the field name string, you are going to run into cases where the receiving software is just JSON, not LSON and it will run into problems by treating the type as part of the name.

I think that not transmitting the type is much better than doing so in a way that can confuse recipients.

Ok. We are at the core of the argument here!

Why do you think that there will be JSON end points that do not understand it? This is a fundamental difference. The whole point is to emit LSON data to the syslog implementations that indicate that they understand LSON. This is why I suggested the capability function in the syslog.

remeber that syslog is just the transport and routing, very little processing is done by syslog. As a result, syslog really doesn't care about the types. The output of syslog to the apps that analyze the data is where typs can matter, but it is also where deviations from standard JSON are more likly to be a problem. While there will be apps that are written specifically for lumberjack, there will be far more that are written for other purposes and just happen to work with lumberjack because we use a generic enough serialization scheme that they can already understand the data.

Having the syslog daemon re-write the log line into a different format is relatively expensive. In some cases it will be done anyway (database inserts for example) and will not matter. I would expect that the cost of reformatting the message would be significantly more than the savings of receiving the message in BSON instead of JSON for example.

May be I am wrong but based on the conversation with Rainer I got the impression that this is exacly what rsyslog is going to do and that other syslog implementations woudl follow. We need to talk about this.

I wouldn't be surprised to see people do this initially, but in the common case where the log message is being passed as-is to some other system or being written as-is (with possibly some additional metadata) to disk, completely re-writing the log message will be inefficient.

with my performance hat on, I would expect that the best way to parse these messages would be to make a copy of the incoming message, go through and find the elements, creating a pointer to the start of each chunk of text and putting a null at the end of the chunk (breaking the buffer into many C strings), the pointers can then be arranged in an efficient memory structure (possibly along the lines that we've been talking about in other threads) and decisions can be made based on individual elements, but the raw data itself can be sent as-is to the destination.

...

...
...
The same thing should work between syslog and remote collection server. If the server indicates it supports LSON syslog will send LSON if only JSON then it will loose the types and emit just JSON.

As long as we make sure that we are being clear tha the version with type info embedded is not JSON, but is instead a new protocol, I have less of a problem (but in that case, why start with JSON instead of something more efficient like BSON?)

JSON/LSON is human readable.

If we want binary format I would go with EXI http://en.wikipedia.org/wiki/Efficient_XML_Interchange and http://exip.sourceforge.net/

this doesn't look like either a mature project (self described as pre-alpha), or a very active project (3 mailing list posts in the last year)

the big advantage that wikipedia calls out for this is the ability to use the schema for compression, and I think we've already talked about not wanting to be dependant on the scheme being the same on both ends (in a wide enviornment like logging, it's going to be impossible to make the schema the same everywhere)

Human readable is good, I'm just not sure if the combination of human readable plus precisely defined types is worth that much. If you are going the route of precisely defined types, I expect that you are doing something where you really care about type enforcement, you will also be careing a lot about efficiency, and will be likely to go with a binary format (which has to provide some type info anyway, extending the list of types is not that bit a change at that point)

David Lang

Dmitri Pal

7:56 p.m.

On 03/26/2012 07:45 PM, david@lang.hm wrote:

...

On Mon, 26 Mar 2012, Dmitri Pal wrote:

...
On 03/26/2012 05:30 PM, david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Dmitri Pal wrote:

...
On 03/26/2012 04:14 PM, david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Keith Robertson wrote:

...
I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

Predefined types (i.e. uid, gid, localAddr, etc.)

Extend the schema. Technically this is possible, but I doubt it

will be very popular. It's more work.

Supply typing with the key:

-- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

I really don't like this last option. it will make converting from one message type to another problematic, and it also makes it so that the same logical field can show up differently to JSON aware endpoints depending on what type the sender happened to use.

especially with the numeric data, how many different types can be used to contain the number '42'? (remember to include string)

If you make the sender's type be part of the field name string, you are going to run into cases where the receiving software is just JSON, not LSON and it will run into problems by treating the type as part of the name.

I think that not transmitting the type is much better than doing so in a way that can confuse recipients.

Ok. We are at the core of the argument here!

Why do you think that there will be JSON end points that do not understand it? This is a fundamental difference. The whole point is to emit LSON data to the syslog implementations that indicate that they understand LSON. This is why I suggested the capability function in the syslog.

remeber that syslog is just the transport and routing, very little processing is done by syslog. As a result, syslog really doesn't care about the types. The output of syslog to the apps that analyze the data is where typs can matter, but it is also where deviations from standard JSON are more likly to be a problem. While there will be apps that are written specifically for lumberjack, there will be far more that are written for other purposes and just happen to work with lumberjack because we use a generic enough serialization scheme that they can already understand the data.

Having the syslog daemon re-write the log line into a different format is relatively expensive. In some cases it will be done anyway (database inserts for example) and will not matter. I would expect that the cost of reformatting the message would be significantly more than the savings of receiving the message in BSON instead of JSON for example.

May be I am wrong but based on the conversation with Rainer I got the impression that this is exacly what rsyslog is going to do and that other syslog implementations woudl follow. We need to talk about this.

I wouldn't be surprised to see people do this initially, but in the common case where the log message is being passed as-is to some other system or being written as-is (with possibly some additional metadata) to disk, completely re-writing the log message will be inefficient.

with my performance hat on, I would expect that the best way to parse these messages would be to make a copy of the incoming message, go through and find the elements, creating a pointer to the start of each chunk of text and putting a null at the end of the chunk (breaking the buffer into many C strings), the pointers can then be arranged in an efficient memory structure (possibly along the lines that we've been talking about in other threads) and decisions can be made based on individual elements, but the raw data itself can be sent as-is to the destination.

This is a very simple paradigm: tell me what you support and I will provide info in the right format. If the syslog implementation plans to parse data but not decompose as you suggested it can say so and tell the library not to include subtypes. This is why I wan syslog implementations to provide a capability call. If the call is not there the library would assume that the syslog implementation does not support subtypes and would not send them. If the entry point is there the library will call it and get the info from the syslog about the format it prefers (JSON/LSON/XML/BSON etc.). Library and syslog implementations will evolve independently so there should be a way for them to agree on the best format the library is capable to produce and syslog is capable to consume in a given configuration.

The question I have is what should be the call name and which shared object or objects should the library inspect to see if the syslog supports this or not.

...

...
...
...
The same thing should work between syslog and remote collection server. If the server indicates it supports LSON syslog will send LSON if only JSON then it will loose the types and emit just JSON.

As long as we make sure that we are being clear tha the version with type info embedded is not JSON, but is instead a new protocol, I have less of a problem (but in that case, why start with JSON instead of something more efficient like BSON?)

JSON/LSON is human readable.

If we want binary format I would go with EXI http://en.wikipedia.org/wiki/Efficient_XML_Interchange and http://exip.sourceforge.net/

this doesn't look like either a mature project (self described as pre-alpha), or a very active project (3 mailing list posts in the last year)

the big advantage that wikipedia calls out for this is the ability to use the schema for compression, and I think we've already talked about not wanting to be dependant on the scheme being the same on both ends (in a wide enviornment like logging, it's going to be impossible to make the schema the same everywhere)

Human readable is good, I'm just not sure if the combination of human readable plus precisely defined types is worth that much. If you are going the route of precisely defined types, I expect that you are doing something where you really care about type enforcement, you will also be careing a lot about efficiency, and will be likely to go with a binary format (which has to provide some type info anyway, extending the list of types is not that bit a change at that point)

David Lang

I should have said "something like EXI".

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

David Lang

8:57 p.m.

On Mon, 26 Mar 2012, Dmitri Pal wrote:

...

On 03/26/2012 07:45 PM, david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Dmitri Pal wrote:

...
On 03/26/2012 05:30 PM, david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Dmitri Pal wrote:

...
Ok. We are at the core of the argument here!

Why do you think that there will be JSON end points that do not understand it? This is a fundamental difference. The whole point is to emit LSON data to the syslog implementations that indicate that they understand LSON. This is why I suggested the capability function in the syslog.

remeber that syslog is just the transport and routing, very little processing is done by syslog. As a result, syslog really doesn't care about the types. The output of syslog to the apps that analyze the data is where typs can matter, but it is also where deviations from standard JSON are more likly to be a problem. While there will be apps that are written specifically for lumberjack, there will be far more that are written for other purposes and just happen to work with lumberjack because we use a generic enough serialization scheme that they can already understand the data.

Having the syslog daemon re-write the log line into a different format is relatively expensive. In some cases it will be done anyway (database inserts for example) and will not matter. I would expect that the cost of reformatting the message would be significantly more than the savings of receiving the message in BSON instead of JSON for example.

May be I am wrong but based on the conversation with Rainer I got the impression that this is exacly what rsyslog is going to do and that other syslog implementations woudl follow. We need to talk about this.

I wouldn't be surprised to see people do this initially, but in the common case where the log message is being passed as-is to some other system or being written as-is (with possibly some additional metadata) to disk, completely re-writing the log message will be inefficient.

with my performance hat on, I would expect that the best way to parse these messages would be to make a copy of the incoming message, go through and find the elements, creating a pointer to the start of each chunk of text and putting a null at the end of the chunk (breaking the buffer into many C strings), the pointers can then be arranged in an efficient memory structure (possibly along the lines that we've been talking about in other threads) and decisions can be made based on individual elements, but the raw data itself can be sent as-is to the destination.

This is a very simple paradigm: tell me what you support and I will provide info in the right format. If the syslog implementation plans to parse data but not decompose as you suggested it can say so and tell the library not to include subtypes. This is why I wan syslog implementations to provide a capability call. If the call is not there the library would assume that the syslog implementation does not support subtypes and would not send them. If the entry point is there the library will call it and get the info from the syslog about the format it prefers (JSON/LSON/XML/BSON etc.). Library and syslog implementations will evolve independently so there should be a way for them to agree on the best format the library is capable to produce and syslog is capable to consume in a given configuration.

The question I have is what should be the call name and which shared object or objects should the library inspect to see if the syslog supports this or not.

you are right, I am mixing up the different areas again.

to reiterate the layers, and to try and make sure I have it straight myself, we have many different layers of communications

1. Log Generation (program to library)

This is the API for the program to generate a message, different for different libraries, and for different calls within a library

So far we have at least four versions in the works

uberlog syslog drop-in replacement

template based C

memory structure based C

log4j based java

There will be more. For example a Perl logging library should be able to take a hash reference that contains the structure of stuff to be logged.

Value of type info at this layer:

input validation

hints for serialization

ease of accepting existing data that the app is already storing (by giving a pointer to the data and what type of data is being pointed out, a big issue in C, a smaller issue in most other languages)

2. Log Transport (library to Dispatcher, talking to a syslog daemon in most cases, relays to other systems, etc))

This should default to something safe and generic that can be handled by systems that have never heard of lumberjack (although not optimally), JSON without types is a good fit for this default)

In addition to this default, there need to be other options for people who want them. At minimum there needs to be an efficient binary option and an option that fully supports types. There probably needs to be a XML option 'just because XML is standard'

We need to think about what options should be mandatory, if any.

I think we are going to need to support plain JSON, XML and one efficient binary format

I am leery of defining official 'optional' formats, unless we get agreement from the major syslog varients to support them as both input and output. We may need to define some sort of connection version handshake that enables anything other than the default format.

Value of type information at this layer:

critical for binary encoded data types

low value otherwise.

3. Log Delivery (syslog to file/database)

In large part this is going to be internal to the syslog implementation, but there should be standard formats to use for writing to files for consistancy across different products.

Beyond this, the format is going to be mostly defined by whatever consumes the data. The major syslog implementations allow this to be customized, but we should try and define some sane standards (although this may be part of the 'phase 3' stuff like database schema definitions).

Value of type info at this layer:

critical for binary encode data

significantly less valuable for other formats

potentially useful for compatibility checking when inserting into databases or similar

Having written this up, I'm not sure that there is really enough value in type information in the transport or delivery layers for anything other than binary encoded data to make it worth passing the type info along. Others can add to the value of type info at each layer and someone may come up with a good enough reason, but I'm not seeing it right now.

If there is a good match between the Log Transport and Log Delivery protocols, the cost of doing the routing and delivery of messages drops drasticaly. Given that the central logging systems are having to handle the full log volume of an organization, this is always something to be aware of.

...

...
...
...
...
The same thing should work between syslog and remote collection server. If the server indicates it supports LSON syslog will send LSON if only JSON then it will loose the types and emit just JSON.

As long as we make sure that we are being clear tha the version with type info embedded is not JSON, but is instead a new protocol, I have less of a problem (but in that case, why start with JSON instead of something more efficient like BSON?)

JSON/LSON is human readable.

If we want binary format I would go with EXI http://en.wikipedia.org/wiki/Efficient_XML_Interchange and http://exip.sourceforge.net/

this doesn't look like either a mature project (self described as pre-alpha), or a very active project (3 mailing list posts in the last year)

the big advantage that wikipedia calls out for this is the ability to use the schema for compression, and I think we've already talked about not wanting to be dependant on the scheme being the same on both ends (in a wide enviornment like logging, it's going to be impossible to make the schema the same everywhere)

Human readable is good, I'm just not sure if the combination of human readable plus precisely defined types is worth that much. If you are going the route of precisely defined types, I expect that you are doing something where you really care about type enforcement, you will also be careing a lot about efficiency, and will be likely to go with a binary format (which has to provide some type info anyway, extending the list of types is not that bit a change at that point)

David Lang

I should have said "something like EXI".

fair enough (when someone points at links, I tend to go read them as assume that that is exactly what the person was proposing, sorry)

Along the lines of the Schema-enhanced compression, the most efficient way to encode a CEE message would be the binary version of:

CEE message type, [field value], [extrafield type, extrafieldname, extrafield value]

where the CEE message type field indexes into a table of field names, types, and parents so that such data doesn't need to be sent over the wire.

but that depends on the CEE message type definition being identical on both ends, and as such, is not suitable as a storage format, and even as a transport format is is fragile unless both sides are in absolute agreement on the CEE message type table (thus the CEE message definition #### in the example above)

All of this is adding to the complexity, we need to try and make this invisible to the application programmer (even having this as configurable options for the programmer is a problem as the large list of options can scare them off, even when they can ignore them)

hmm, the least expensive way I can think of providing options is to have the server send a message to anything that connects to it providing it's capabilities and then the sender sending which format it's using, along with the log data. This allows the sysadmin to control the log format from one place (the syslog server) instead of having to configure each app.

something along the lines of a binary version of:

I'm rsyslog version X, I support JSON, LSON, XML schema A, XML schema B, EXI schem C, CEE message definition 1234

and then the sender sends the message

The sender should always be able to send raw JSON without looking at what the receiver says (to make it trivially easy for stuff to write some sort of log message), or send a type field (which should not be valid JSON to avoid confusion) and then the data is one of the other recognized formats.

David Lang

Botond Botyanszki

4:32 p.m.

On Mon, 26 Mar 2012 13:14:38 -0700 (PDT) david@lang.hm wrote:

...

On Mon, 26 Mar 2012, Keith Robertson wrote:

...
I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

Predefined types (i.e. uid, gid, localAddr, etc.)

Extend the schema. Technically this is possible, but I doubt it will

be very popular. It's more work.

Supply typing with the key:

-- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

I really don't like this last option. it will make converting from one message type to another problematic, and it also makes it so that the same logical field can show up differently to JSON aware endpoints depending on what type the sender happened to use.

especially with the numeric data, how many different types can be used to contain the number '42'? (remember to include string)

This is why I don't see any benefit from having [u]int[32|64] in the logs. So just one: integer

...

If you make the sender's type be part of the field name string, you are going to run into cases where the receiving software is just JSON, not LSON and it will run into problems by treating the type as part of the name.

For this reason the LSON type should be explicitly specified so that the receiver can correctly parse it.

...

I think that not transmitting the type is much better than doing so in a way that can confuse recipients.

As long as there is at least one format (XML, BSON etc) which can transmit types correctly, that's ok. Though implementing support for LSON is pretty trivial and probably more easier for implementations already supporting JSON than to add XML/BSON support.

Regards, Botond

David Lang

5:33 p.m.

On Mon, 26 Mar 2012, Botond Botyanszki wrote:

...

On Mon, 26 Mar 2012 13:14:38 -0700 (PDT) david@lang.hm wrote:

...
On Mon, 26 Mar 2012, Keith Robertson wrote:

...
I'm with you.

To summarize, there are 3 ways to convey typing information that need to be documented:

Predefined types (i.e. uid, gid, localAddr, etc.)

Extend the schema. Technically this is possible, but I doubt it will

be very popular. It's more work.

Supply typing with the key:

-- JSON way: "myField:myType" : "abc123" -- XML way: <myField type='myType'>abc123</myField>

I really don't like this last option. it will make converting from one message type to another problematic, and it also makes it so that the same logical field can show up differently to JSON aware endpoints depending on what type the sender happened to use.

especially with the numeric data, how many different types can be used to contain the number '42'? (remember to include string)

This is why I don't see any benefit from having [u]int[32|64] in the logs. So just one: integer

that still leaves three options 4, 4.0, "4"

It's not that I expect one application using one library to use more than one type, it's that I expect that different applications from different vendors (likely useing different libraries and written in different languages) are likely to use different representations of data for log messages that are the otherwise the same structure

David Lang

kroberts＠redhat.com

27 Mar 27 Mar

8:54 a.m.

New subject: Encoding types in JSON

Starting a new thread here.

Recently, we've been discussing encoding types into JSON. One proposal is [1]. After thinking about this some more last night, I now think that this is a bad idea. It is bad because it would prevent an easy transformation from JSON to XML. If we want to come up with an *optional* strategy for conveying type info in JSON... fine; however, we should absolutely not do it as recommended in [1].

[1] "myField:myType" : "abc123" ----- Morphs into invalid XML---> myfield:mytypeabc123myfield:mytype/

Cheers, Keith

Botond Botyanszki

10:04 a.m.

New subject: Encoding types in JSON

On Tue, 27 Mar 2012 08:54:35 -0400 Keith Robertson kroberts@redhat.com wrote:

...

Starting a new thread here.

Recently, we've been discussing encoding types into JSON. One proposal is [1]. After thinking about this some more last night, I now think that this is a bad idea. It is bad because it would prevent an easy transformation from JSON to XML. If we want to come up with an *optional* strategy for conveying type info in JSON... fine; however, we should absolutely not do it as recommended in [1].

[1] "myField:myType" : "abc123" ----- Morphs into invalid XML---> myfield:mytypeabc123myfield:mytype/

The proposal would be to call this LSON and not claim it is JSON. Anybody who blindly converts LSON into XML like that will end up with such an XML. There would be an option to output pure JSON for systems not capable of handling LSON. Other than that, what's your proposal for encoding type information with the field name+value? If we only want to support encoding types in XML only, that's fine with me. The problem with this is also that producers will likely omit the type attribute.

Regards, Botond

William Heinbockel

10:10 a.m.

New subject: Encoding types in JSON

...

The proposal would be to call this LSON and not claim it is JSON. Anybody who blindly converts LSON into XML like that will end up with such an XML. There would be an option to output pure JSON for systems not capable of handling LSON. Other than that, what's your proposal for encoding type information with the field name+value? If we only want to support encoding types in XML only, that's fine with me. The problem with this is also that producers will likely omit the type attribute.

I don't see this as a problem

Besides, you are always going to have the issue of producers not putting type information Or putting the incorrect type information Or putting their own, custom types in

All that I care about is that their is an approved, consistent representation syntax: * Timestamps should be formatted according to ISO8601 * Numbers should be formatted according the the JSON spec * IPv4 addresses should be in dotted-decimal * IPv6 addresses should be in colon-hex

The specification must state these requirements, CEE or other test/conformance tools can check to see if they do it correctly.

Botond Botyanszki

10:21 a.m.

New subject: Encoding types in JSON

On Tue, 27 Mar 2012 10:10:40 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...

...
The proposal would be to call this LSON and not claim it is JSON. Anybody who blindly converts LSON into XML like that will end up with such an XML. There would be an option to output pure JSON for systems not capable of handling LSON. Other than that, what's your proposal for encoding type information with the field name+value? If we only want to support encoding types in XML only, that's fine with me. The problem with this is also that producers will likely omit the type attribute.

I don't see this as a problem

Besides, you are always going to have the issue of producers not putting type information

With BSON there is always type information because the format requires that. So if the source is requested to produce BSON, the receiver can be sure to get type information. With XML this is not the case and if type attributes are omitted, the receiver can't do other than to guess or look up a the type from the field definition list (if there will be such).

...

Or putting the incorrect type information Or putting their own, custom types in

Sure. This should be allowed to some extent because log producers would/will abuse the format anyway.

...

All that I care about is that their is an approved, consistent representation syntax:

Timestamps should be formatted according to ISO8601

Numbers should be formatted according the the JSON spec

IPv4 addresses should be in dotted-decimal

IPv6 addresses should be in colon-hex

The specification must state these requirements, CEE or other test/conformance tools can check to see if they do it correctly.

I agree.

Regards, Botond

Dmitri Pal

12:45 p.m.

New subject: Encoding types in JSON

On 03/27/2012 10:21 AM, Botond Botyanszki wrote:

...

On Tue, 27 Mar 2012 10:10:40 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
...
The proposal would be to call this LSON and not claim it is JSON. Anybody who blindly converts LSON into XML like that will end up with such an XML. There would be an option to output pure JSON for systems not capable of handling LSON. Other than that, what's your proposal for encoding type information with the field name+value? If we only want to support encoding types in XML only, that's fine with me. The problem with this is also that producers will likely omit the type attribute.

I don't see this as a problem

Besides, you are always going to have the issue of producers not putting type information

With BSON there is always type information because the format requires that. So if the source is requested to produce BSON, the receiver can be sure to get type information. With XML this is not the case and if type attributes are omitted, the receiver can't do other than to guess or look up a the type from the field definition list (if there will be such).

...
Or putting the incorrect type information Or putting their own, custom types in

Sure. This should be allowed to some extent because log producers would/will abuse the format anyway.

...
All that I care about is that their is an approved, consistent representation syntax:

Timestamps should be formatted according to ISO8601

Numbers should be formatted according the the JSON spec

IPv4 addresses should be in dotted-decimal

IPv6 addresses should be in colon-hex

The specification must state these requirements, CEE or other test/conformance tools can check to see if they do it correctly.

I agree.

Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

To summarize:

If we request types from the application developer we have much more freedom in passing the type information on with different formats. I think the biggest take away is: "support as many types in the API is possible you can drop the types in conversion if necessary". So from C API that will collect all type info (I will probably add back different numeric types) we can produce all sort of different transport formats. This is not more that a serialization task. We will start implement serialization in the following order: 1) JSON 2) XML 3) XML with types 4) JSON with types (LSON) 5) BSON

We can continue as much as we see the need and value. We can change the order too.

I think we discussed the types enough to understand what needs to be done at the boundary of different levels.

I will post a proposed plan in a separate mail.

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Botond Botyanszki

1:01 p.m.

New subject: Encoding types in JSON

On Tue, 27 Mar 2012 12:45:10 -0400 Dmitri Pal dpal@redhat.com wrote:

...

So from C API that will collect all type info (I will probably add back different numeric types) we can produce all sort of different transport

What's the benefit for having those numeric types? Problem with is that these can be only expressed as a subtype of string then, unless we support subtyping other types as well. But again, if this gets too complex, it will just scare away people.

In my opinion what we need to support is the list provided by Bill: On Mon, 26 Mar 2012 19:04:58 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...

Extrapolating this for types, you need to support:

string

number

timestamp/datetime

IPv4/IPv6

boolean

This extended by 'binary' and custom types provided by the programmer should be sufficient IMO.

...

formats. This is not more that a serialization task. We will start implement serialization in the following order:

JSON

XML

XML with types

JSON with types (LSON)

BSON

Sounds good to me.

Regards, Botond

William Heinbockel

1:47 p.m.

New subject: Encoding types in JSON

...

...
We will start implement serialization in the following order:

JSON

XML

XML with types

JSON with types (LSON)

BSON

Sounds good to me.

There are too many options here, with no real rational for them.

We need the following:

1. Compatibility with Syslog 2. Simple (preferably with library support) 3. Explicit typing

I would just narrow that list down to:

1. JSON 2. XML (with XML Schema support) 3?. BSON (this is already close to JSON and supporting libraries and would make a nice binary event storage format)

I see no reason for LSON. You want types, use XML or BSON.

kroberts＠redhat.com

4:53 p.m.

New subject: Encoding types in JSON

On 03/27/2012 01:47 PM, William Heinbockel wrote:

...

...
...
We will start implement serialization in the following order:

JSON

XML

XML with types

JSON with types (LSON)

BSON

Sounds good to me.

There are too many options here, with no real rational for them.

We need the following:

Compatibility with Syslog

Simple (preferably with library support)

Explicit typing

I would just narrow that list down to:

JSON

...

XML (with XML Schema support)

...

3?. BSON (this is already close to JSON and supporting libraries and would make a nice binary event storage format)

Meh. Let's base this on adoption. If there is a need then yeah, but let's keep it simple.

...

I see no reason for LSON. You want types, use XML or BSON.

Seems simple enough. Let's start with it.

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

Dmitri Pal

7:32 p.m.

New subject: Encoding types in JSON

On 03/27/2012 04:53 PM, Keith Robertson wrote:

...

On 03/27/2012 01:47 PM, William Heinbockel wrote:

...
...
...
We will start implement serialization in the following order:

JSON

XML

XML with types

JSON with types (LSON)

BSON

Sounds good to me.

There are too many options here, with no real rational for them.

We need the following:

Compatibility with Syslog

Simple (preferably with library support)

Explicit typing

I would just narrow that list down to:

JSON

+1

...

XML (with XML Schema support)

+1

You mean with explicit types?

...

...
3?. BSON (this is already close to JSON and supporting libraries and would make a nice binary event storage format)

Meh. Let's base this on adoption. If there is a need then yeah, but let's keep it simple.

...
I see no reason for LSON. You want types, use XML or BSON.

Seems simple enough. Let's start with it.

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

kroberts＠redhat.com

28 Mar 28 Mar

9:16 a.m.

New subject: Encoding types in JSON

On 03/27/2012 07:32 PM, Dmitri Pal wrote:

...

On 03/27/2012 04:53 PM, Keith Robertson wrote:

...
On 03/27/2012 01:47 PM, William Heinbockel wrote:

...
...
...
We will start implement serialization in the following order:

JSON

XML

XML with types

JSON with types (LSON)

BSON

Sounds good to me.

There are too many options here, with no real rational for them.

We need the following:

Compatibility with Syslog

Simple (preferably with library support)

Explicit typing

I would just narrow that list down to:

JSON

+1

...

XML (with XML Schema support)

+1

You mean with explicit types?

The XSD has some "canned" keys/elements (i.e. pid,uid, etc.) which are explicitly typed. Further, users wish to express events in XML format can provide their own extensions that are either typed or un-typed as both are options valid in XSD (see [1]).

For JSON, I'm on the fence. The language doesn't really support typing, but if we want to add some sort of optional attribute to convey it then fine. This optional attribute should not prevent a direct translation to XML, IMHO. In short, a JSON event should be translatable to an XML event with minimal effort.

[1] <complexType name="IPTablesTypeDenyType"> <complexContent> <extension base="Q1:BaseProfileType"> <sequence> <element name="HWAddr" minOccurs="1" maxOccurs="1" type="Q1:macAddressType" /> <element name="AnUntypedElementThatIsValid" minOccurs="1" maxOccurs="1" /> </sequence> </extension> </complexContent> </complexType>

...

...
...
3?. BSON (this is already close to JSON and supporting libraries and would make a nice binary event storage format)

Meh. Let's base this on adoption. If there is a need then yeah, but let's keep it simple.

...
I see no reason for LSON. You want types, use XML or BSON.

Seems simple enough. Let's start with it.

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

Dmitri Pal

11:52 a.m.

New subject: Encoding types in JSON

On 03/28/2012 09:16 AM, Keith Robertson wrote:

...

On 03/27/2012 07:32 PM, Dmitri Pal wrote:

...
On 03/27/2012 04:53 PM, Keith Robertson wrote:

...
On 03/27/2012 01:47 PM, William Heinbockel wrote:

...
...
...
We will start implement serialization in the following order:

JSON

XML

XML with types

JSON with types (LSON)

BSON

Sounds good to me.

There are too many options here, with no real rational for them.

We need the following:

Compatibility with Syslog

Simple (preferably with library support)

Explicit typing

I would just narrow that list down to:

JSON

+1

...

XML (with XML Schema support)

+1

You mean with explicit types?

The XSD has some "canned" keys/elements (i.e. pid,uid, etc.) which are explicitly typed. Further, users wish to express events in XML format can provide their own extensions that are either typed or un-typed as both are options valid in XSD (see [1]).

For JSON, I'm on the fence. The language doesn't really support typing, but if we want to add some sort of optional attribute to convey it then fine. This optional attribute should not prevent a direct translation to XML, IMHO. In short, a JSON event should be translatable to an XML event with minimal effort.

[1]

<complexType name="IPTablesTypeDenyType"> <complexContent> <extension base="Q1:BaseProfileType"> <sequence> <element name="HWAddr" minOccurs="1" maxOccurs="1" type="Q1:macAddressType" /> <element name="AnUntypedElementThatIsValid" minOccurs="1" maxOccurs="1" /> </sequence> </extension> </complexContent> </complexType>

OK this is what I thought. The types are in the dictionary not in the XML itself but there was a conversation to be able to add type attribute or type element into the XML. I think Bill was bringing it up. How is it related? I am a bit lost there.

...

...
...
...
3?. BSON (this is already close to JSON and supporting libraries and would make a nice binary event storage format)

Meh. Let's base this on adoption. If there is a need then yeah, but let's keep it simple.

...
I see no reason for LSON. You want types, use XML or BSON.

Seems simple enough. Let's start with it.

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

William Heinbockel

1:07 p.m.

New subject: Encoding types in JSON

On Wed, Mar 28, 2012 at 11:52 AM, Dmitri Pal dpal@redhat.com wrote:

...

On 03/28/2012 09:16 AM, Keith Robertson wrote:

...
On 03/27/2012 07:32 PM, Dmitri Pal wrote:

...
On 03/27/2012 04:53 PM, Keith Robertson wrote:

...
On 03/27/2012 01:47 PM, William Heinbockel wrote:

...
...
> We will start implement serialization in the following order: > 1) JSON > 2) XML > 3) XML with types > 4) JSON with types (LSON) > 5) BSON Sounds good to me.

There are too many options here, with no real rational for them.

We need the following:

Compatibility with Syslog

Simple (preferably with library support)

Explicit typing

I would just narrow that list down to:

JSON

+1

...

XML (with XML Schema support)

+1

You mean with explicit types?

The XSD has some "canned" keys/elements (i.e. pid,uid, etc.) which are explicitly typed. Further, users wish to express events in XML format can provide their own extensions that are either typed or un-typed as both are options valid in XSD (see [1]).

For JSON, I'm on the fence. The language doesn't really support typing, but if we want to add some sort of optional attribute to convey it then fine. This optional attribute should not prevent a direct translation to XML, IMHO. In short, a JSON event should be translatable to an XML event with minimal effort.

[1]

<complexType name="IPTablesTypeDenyType"> <complexContent> <extension base="Q1:BaseProfileType"> <sequence> <element name="HWAddr" minOccurs="1" maxOccurs="1" type="Q1:macAddressType" /> <element name="AnUntypedElementThatIsValid" minOccurs="1" maxOccurs="1" /> </sequence> </extension> </complexContent> </complexType>

OK this is what I thought. The types are in the dictionary not in the XML itself but there was a conversation to be able to add type attribute or type element into the XML. I think Bill was bringing it up. How is it related? I am a bit lost there.

There are two different (but related) discussions here: field types defined a priori vs. defined at runtime

The problem with XML Schema is that the element name-type binding is external and must be defined a priori. So you either need to build the bindings directly into your code (via JAXB) or provide some way of resolving the type at runtime.

My thought was that the XML Schema would define the field type and profile structure. This would be able to validate the format and content of the event. The @type attribute would be optional, and provide a way of explicitly defining (or casting) the value type so that it could be processed at runtime.

I have a schema structure to support some of this. I just haven't found the cycles to consolidate it with Keith's published version.

Dmitri Pal

27 Mar 27 Mar

5:28 p.m.

New subject: Encoding types in JSON

On 03/27/2012 01:01 PM, Botond Botyanszki wrote:

...

On Tue, 27 Mar 2012 12:45:10 -0400 Dmitri Pal dpal@redhat.com wrote:

...
So from C API that will collect all type info (I will probably add back different numeric types) we can produce all sort of different transport

What's the benefit for having those numeric types? Problem with is that these can be only expressed as a subtype of string then, unless we support subtyping other types as well. But again, if this gets too complex, it will just scare away people.

I think we can have those types supported by API and then decide how we want to pass them on. I will not make the interface more complex but would allow us to decide later if we need it or not. May be we do not see something and as API gets released we will have hard time changing it or adding new types.

...

In my opinion what we need to support is the list provided by Bill: On Mon, 26 Mar 2012 19:04:58 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...
Extrapolating this for types, you need to support:

string

number

timestamp/datetime

IPv4/IPv6

boolean

This extended by 'binary' and custom types provided by the programmer should be sufficient IMO.

...
formats. This is not more that a serialization task. We will start implement serialization in the following order:

JSON

XML

XML with types

JSON with types (LSON)

BSON

Sounds good to me.

Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Gergely Nagy

28 Mar 28 Mar

5:48 a.m.

New subject: Encoding types in JSON

Dmitri Pal dpal@redhat.com writes:

...

If we request types from the application developer we have much more freedom in passing the type information on with different formats. I think the biggest take away is: "support as many types in the API is possible you can drop the types in conversion if necessary".

...

So from C API that will collect all type info (I will probably add back different numeric types) we can produce all sort of different transport formats. This is not more that a serialization task. We will start implement serialization in the following order:

JSON

XML

I think this two is enough. JSON with types is, well.. awkward, and unnecessary until there's a demand for it (and I don't expect there to be any. If you want custom types, or more than what JSON supports, there are better options - not neccessarily text based options, but still).

-- |8]

William Heinbockel

27 Mar 27 Mar

10:05 a.m.

New subject: Encoding types in JSON

On Tue, Mar 27, 2012 at 8:54 AM, Keith Robertson kroberts@redhat.com wrote:

...

Starting a new thread here.

Recently, we've been discussing encoding types into JSON. One proposal is [1]. After thinking about this some more last night, I now think that this is a bad idea. It is bad because it would prevent an easy transformation from JSON to XML. If we want to come up with an *optional* strategy for conveying type info in JSON... fine; however, we should absolutely not do it as recommended in [1].

[1] "myField:myType" : "abc123" ----- Morphs into invalid XML---> myfield:mytypeabc123myfield:mytype/

Exactly! This is why there was a proposal to call that format LSON -- it is syntactically compatible with JSON, but has additional information that must be properly handled.

Frankly, I still don't see a need for LSON or supporting explicit typing in JSON -- you have BSON and XML if you want explicit types.

Heinbockel, Bill

26 Mar 26 Mar

3:59 p.m.

...

...
In addition, if the type is also specified along with the value, it would override the default and give a hint to the reader/perser if the field is not in the list of well-known fields, i.e.: { ... "somUserDefinedFieldNotInTheXSD:someUserDefinedType" : "abc123" } This is not necessarily JSON/LSON but can be XML with the 'type' attribute missing or present.

OK. I don't really have a problem with this. Seems like a reasonable way to convey typing for JSON. However, it wouldn't work for XML...

<p:asdf:jkl>look ma no hands</p:asdf:jkl> <-- That element name is invalid. AFAIK.

You would be correct

...

I guess we should document that this is a legal option for JSON in the specification.

Bill, suggestions on verbiage... if you agree?

The problem is that "somUserDefinedFieldNotInTheXSD:someUserDefinedType", while is valid JSON, will not be handled properly by most (READ: all) JSON libraries. Therefore, we must have some way of informing the application to parse it as "LSON" vs. "JSON", and provide specifications for what constitutes valid LSON.

Anything transmitted via LSON, MUST be parsed either directly by a LSON parser *or* by a JSON parser, then again by the application in order to separate the type name from the field value.

As you said, "somUserDefinedFieldNotInTheXSD:someUserDefinedType" cannot be translated directly into XML, otherwise you'll have issues with XML namespaces and the field name will not align with that of the XML Schema. Instead, it must be parsed into <somUserDefinedFieldNotInTheXSD type="someUserDefinedType">

I have no problem pursuing this route with the following conditions:

1. We support JSON and LSON 2. LSON syntax compatible with JSON (i.e., can be parsed by any JSON parser), but we treat these as two distinct syntaxes 3. There is provided a way to distinguish LSON and JSON without having to parse the document (e.g., @json vs. @lson) 4. We define the valid types and formats in the LSON specification and do not allow for user-defined value types

Dmitri Pal

4:15 p.m.

On 03/26/2012 03:59 PM, Heinbockel, Bill wrote:

...

...
...
In addition, if the type is also specified along with the value, it would override the default and give a hint to the reader/perser if the field is not in the list of well-known fields, i.e.: { ... "somUserDefinedFieldNotInTheXSD:someUserDefinedType" : "abc123" } This is not necessarily JSON/LSON but can be XML with the 'type' attribute missing or present.

OK. I don't really have a problem with this. Seems like a reasonable way to convey typing for JSON. However, it wouldn't work for XML...

<p:asdf:jkl>look ma no hands</p:asdf:jkl> <-- That element name is invalid. AFAIK.

You would be correct

...
I guess we should document that this is a legal option for JSON in the specification.

Bill, suggestions on verbiage... if you agree?

The problem is that "somUserDefinedFieldNotInTheXSD:someUserDefinedType", while is valid JSON, will not be handled properly by most (READ: all) JSON libraries. Therefore, we must have some way of informing the application to parse it as "LSON" vs. "JSON", and provide specifications for what constitutes valid LSON.

Anything transmitted via LSON, MUST be parsed either directly by a LSON parser *or* by a JSON parser, then again by the application in order to separate the type name from the field value.

As you said, "somUserDefinedFieldNotInTheXSD:someUserDefinedType" cannot be translated directly into XML, otherwise you'll have issues with XML namespaces and the field name will not align with that of the XML Schema. Instead, it must be parsed into <somUserDefinedFieldNotInTheXSD type="someUserDefinedType">

I have no problem pursuing this route with the following conditions:

We support JSON and LSON

LSON syntax compatible with JSON (i.e., can be parsed by any JSON parser), but we treat these as two distinct syntaxes

There is provided a way to distinguish LSON and JSON without having to parse the document (e.g., @json vs. @lson)

We define the valid types and formats in the LSON specification and do not allow for user-defined value types

I took a very first stub at the interface. This is just a header no actual code behind. See attached. I will need to add more examples but I would rather just build unit tests that act as examples. Kills two birds with one stone.

Questions: 1) Am I conceptually on the right path? 2) Is naming convention OK? 3) What other capabilities we want to have in this library? Does it need to write to the console or file? If so I would need to have something like selog_init_file() function for example. 4) What about the other fields. Which of those need to be automatically resolved if not provided in user input and which should not. I plan to control it with flags that are currently TBD but what is the default set of those that need to be added and those that should be left out? Please comment on each

ppid (long) : The numeric parent process identifier of the process that generated the event record. uid (long) : The numeric user identifier of the process that generated the event record. gid (long) : The numeric group identifier of the process that generated the event record.

can resolve, should I do it and do it once? May be add a function like selog_refresh() that would reread those in cases when process changes privileges?

tid (long) : The numeric thread identifier of the process that generated the event record.

should I resolve it?

proc (string): The name of the process that produced the event record. The process should belong to the application identified by app field. Example: java

How it is different from "app? ?

tty (string): The TeleTYpe. Example (pts/14)

IMO should be user supplied?

localHostName (hostnameType):The hostname of the local system that generated the event record. localAddr (ipAddressType): The IP of the local system that generated the event record.

IMO should be resolved once. Would be refreshable by selog_refresh();

foreignHostName (hostnameType): The hostname of the remote system involved in the event record. foreignAddr (ipAddressType): The IP of the remote system that generated the event record. localPort (int): The local port. foreignPort (int): The remote port. username (string): The user involved in the event. classname (string): The class or source file that produced the event. methodname (string): The the method or function that produced the event. url (anyURI): The URL involved in the event.

IMO user provided.

Do I get it right?

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Dmitri Pal

4:25 p.m.

On 03/26/2012 04:15 PM, Dmitri Pal wrote:

...

On 03/26/2012 03:59 PM, Heinbockel, Bill wrote:

...
...
...
In addition, if the type is also specified along with the value, it would override the default and give a hint to the reader/perser if the field is not in the list of well-known fields, i.e.: { ... "somUserDefinedFieldNotInTheXSD:someUserDefinedType" : "abc123" } This is not necessarily JSON/LSON but can be XML with the 'type' attribute missing or present.

OK. I don't really have a problem with this. Seems like a reasonable way to convey typing for JSON. However, it wouldn't work for XML...

<p:asdf:jkl>look ma no hands</p:asdf:jkl> <-- That element name is invalid. AFAIK.

You would be correct

...
I guess we should document that this is a legal option for JSON in the specification.

Bill, suggestions on verbiage... if you agree?

The problem is that "somUserDefinedFieldNotInTheXSD:someUserDefinedType", while is valid JSON, will not be handled properly by most (READ: all) JSON libraries. Therefore, we must have some way of informing the application to parse it as "LSON" vs. "JSON", and provide specifications for what constitutes valid LSON.

Anything transmitted via LSON, MUST be parsed either directly by a LSON parser *or* by a JSON parser, then again by the application in order to separate the type name from the field value.

As you said, "somUserDefinedFieldNotInTheXSD:someUserDefinedType" cannot be translated directly into XML, otherwise you'll have issues with XML namespaces and the field name will not align with that of the XML Schema. Instead, it must be parsed into <somUserDefinedFieldNotInTheXSD type="someUserDefinedType">

I have no problem pursuing this route with the following conditions:

We support JSON and LSON

LSON syntax compatible with JSON (i.e., can be parsed by any JSON parser), but we treat these as two distinct syntaxes

There is provided a way to distinguish LSON and JSON without having to parse the document (e.g., @json vs. @lson)

We define the valid types and formats in the LSON specification and do not allow for user-defined value types

+1

I took a very first stub at the interface. This is just a header no actual code behind. See attached. I will need to add more examples but I would rather just build unit tests that act as examples. Kills two birds with one stone.

Questions:

Am I conceptually on the right path?

Is naming convention OK?

What other capabilities we want to have in this library? Does it need

to write to the console or file? If so I would need to have something like selog_init_file() function for example. 4) What about the other fields. Which of those need to be automatically resolved if not provided in user input and which should not. I plan to control it with flags that are currently TBD but what is the default set of those that need to be added and those that should be left out? Please comment on each

ppid (long) : The numeric parent process identifier of the process that generated the event record. uid (long) : The numeric user identifier of the process that generated the event record. gid (long) : The numeric group identifier of the process that generated the event record.

can resolve, should I do it and do it once? May be add a function like selog_refresh() that would reread those in cases when process changes privileges?

tid (long) : The numeric thread identifier of the process that generated the event record.

should I resolve it?

proc (string): The name of the process that produced the event record. The process should belong to the application identified by app field. Example: java

How it is different from "app? ?

tty (string): The TeleTYpe. Example (pts/14)

IMO should be user supplied?

localHostName (hostnameType):The hostname of the local system that generated the event record. localAddr (ipAddressType): The IP of the local system that generated the event record.

IMO should be resolved once. Would be refreshable by selog_refresh();

foreignHostName (hostnameType): The hostname of the remote system involved in the event record. foreignAddr (ipAddressType): The IP of the remote system that generated the event record. localPort (int): The local port. foreignPort (int): The remote port. username (string): The user involved in the event. classname (string): The class or source file that produced the event. methodname (string): The the method or function that produced the event. url (anyURI): The URL involved in the event.

IMO user provided.

Do I get it right?

Here is another revision. I added an IP address with subtypes to illustrate.

...

...

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Botond Botyanszki

4:42 p.m.

On Mon, 26 Mar 2012 16:25:59 -0400 Dmitri Pal dpal@redhat.com wrote:

...

...
Do I get it right?

Here is another revision. I added an IP address with subtypes to illustrate.

Here are my observations:

...

#define SELOG_TYPE_I32 1 #define SELOG_TYPE_U32 2 #define SELOG_TYPE_I64 3 #define SELOG_TYPE_U64 4

I see no use of these if the type is not used in the output:

...

error = selog_kvp(LOG_INFO, 
                  "field2", SELOG_TYPE_I32, -5,

... "field2": -5,

I'd use only SELOG_TYPE_INTEGER. For adding numeric kvps, you can cast any 32 bit integer to int64_t. The only problem here would be uint64_t overflowing. But such a large unsigned integer would present a problem to most json parsers anyway because they all use int64_t or int.

...

/* Function to modify an event, can be used to remove or update values */ int selog_event_modify(selog_e event, ...);

Can this add a single kvp? How can you iteratively build an event? This is not sufficient for that:

...

error = selog_kvp(LOG_INFO, 
                  "field1", SELOG_TYPE_STRING, "my string",
                  "field2", SELOG_TYPE_I32, -5,
                  "peer_ip", SELOG_TYPE_IPV4STR, "192.168.0.1"
                  "subset!a", SELOG_TYPE_BOOLSTR, "yes",
                  "subset!b", SELOG_TYPE_BIN, 10, bindata,
                  NULL);

Regards, Botond

Dmitri Pal

5:12 p.m.

On 03/26/2012 04:42 PM, Botond Botyanszki wrote:

...

On Mon, 26 Mar 2012 16:25:59 -0400 Dmitri Pal dpal@redhat.com wrote:

...
...
Do I get it right?

Here is another revision. I added an IP address with subtypes to illustrate.

Here are my observations:

...
#define SELOG_TYPE_I32 1 #define SELOG_TYPE_U32 2 #define SELOG_TYPE_I64 3 #define SELOG_TYPE_U64 4

I see no use of these if the type is not used in the output:

...
error = selog_kvp(LOG_INFO, 
                  "field2", SELOG_TYPE_I32, -5,
... "field2": -5,
I'd use only SELOG_TYPE_INTEGER. For adding numeric kvps, you can cast any 32 bit integer to int64_t. The only problem here would be uint64_t overflowing. But such a large unsigned integer would present a problem to most json parsers anyway because they all use int64_t or int.

OK, removed. Types updated.

...

...
/* Function to modify an event, can be used to remove or update values */ int selog_event_modify(selog_e event, ...);

Can this add a single kvp?

Yes.

...

How can you iteratively build an event?

I added another example at the bottom that shows how the event can be gradually built. New version attached.

Another thing that we talked about is a message attribute. I have not added it yet. I will do it a bit later.

...

This is not sufficient for that:

...
error = selog_kvp(LOG_INFO, 
                  "field1", SELOG_TYPE_STRING, "my string",
                  "field2", SELOG_TYPE_I32, -5,
                  "peer_ip", SELOG_TYPE_IPV4STR, "192.168.0.1"
                  "subset!a", SELOG_TYPE_BOOLSTR, "yes",
                  "subset!b", SELOG_TYPE_BIN, 10, bindata,
                  NULL);
Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Dmitri Pal

5:28 p.m.

On 03/26/2012 05:12 PM, Dmitri Pal wrote:

...

On 03/26/2012 04:42 PM, Botond Botyanszki wrote:

...
On Mon, 26 Mar 2012 16:25:59 -0400 Dmitri Pal dpal@redhat.com wrote:

...
...
Do I get it right?

Here is another revision. I added an IP address with subtypes to illustrate.

Here are my observations:

...
#define SELOG_TYPE_I32 1 #define SELOG_TYPE_U32 2 #define SELOG_TYPE_I64 3 #define SELOG_TYPE_U64 4

I see no use of these if the type is not used in the output:

...
error = selog_kvp(LOG_INFO, 
                  "field2", SELOG_TYPE_I32, -5,
... "field2": -5,
I'd use only SELOG_TYPE_INTEGER. For adding numeric kvps, you can cast any 32 bit integer to int64_t. The only problem here would be uint64_t overflowing. But such a large unsigned integer would present a problem to most json parsers anyway because they all use int64_t or int.
OK, removed. Types updated.

...
...
/* Function to modify an event, can be used to remove or update values */ int selog_event_modify(selog_e event, ...);

Can this add a single kvp?

Yes.

...
How can you iteratively build an event?

I added another example at the bottom that shows how the event can be gradually built. New version attached.

Another thing that we talked about is a message attribute. I have not added it yet. I will do it a bit later.

Here is another version:

1) Added the cleanup function to free the event 2) Update the second example to illustrate steps. Logic:

a) Initialize b) Create an event that will store common data that will be reused in some part of code c) Create an event by extending a reusable set called "common" d) Modify the event before logging e) Log the event together with yet another KVP f) Free the data g) Finalize

Makes sense?

...

...
This is not sufficient for that:

...
error = selog_kvp(LOG_INFO, 
                  "field1", SELOG_TYPE_STRING, "my string",
                  "field2", SELOG_TYPE_I32, -5,
                  "peer_ip", SELOG_TYPE_IPV4STR, "192.168.0.1"
                  "subset!a", SELOG_TYPE_BOOLSTR, "yes",
                  "subset!b", SELOG_TYPE_BIN, 10, bindata,
                  NULL);
Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers
lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Dmitri Pal

5:30 p.m.

On 03/26/2012 05:28 PM, Dmitri Pal wrote:

...

On 03/26/2012 05:12 PM, Dmitri Pal wrote:

...
On 03/26/2012 04:42 PM, Botond Botyanszki wrote:

...
On Mon, 26 Mar 2012 16:25:59 -0400 Dmitri Pal dpal@redhat.com wrote:

...
...
Do I get it right?

Here is another revision. I added an IP address with subtypes to illustrate.

Here are my observations:

...
#define SELOG_TYPE_I32 1 #define SELOG_TYPE_U32 2 #define SELOG_TYPE_I64 3 #define SELOG_TYPE_U64 4

I see no use of these if the type is not used in the output:

...
error = selog_kvp(LOG_INFO, 
                  "field2", SELOG_TYPE_I32, -5,
... "field2": -5,
I'd use only SELOG_TYPE_INTEGER. For adding numeric kvps, you can cast any 32 bit integer to int64_t. The only problem here would be uint64_t overflowing. But such a large unsigned integer would present a problem to most json parsers anyway because they all use int64_t or int.
OK, removed. Types updated.

...
...
/* Function to modify an event, can be used to remove or update values */ int selog_event_modify(selog_e event, ...);

Can this add a single kvp?

Yes.

...
How can you iteratively build an event?

I added another example at the bottom that shows how the event can be gradually built. New version attached.

Another thing that we talked about is a message attribute. I have not added it yet. I will do it a bit later.
Here is another version:

Added the cleanup function to free the event

Update the second example to illustrate steps. Logic:

a) Initialize b) Create an event that will store common data that will be reused

in some part of code c) Create an event by extending a reusable set called "common" d) Modify the event before logging e) Log the event together with yet another KVP f) Free the data g) Finalize

Makes sense?

I will let us sit on the list for couple days for the next round of comments. I will get back to it on Wednesday.

...

...
...
This is not sufficient for that:

...
error = selog_kvp(LOG_INFO, 
                  "field1", SELOG_TYPE_STRING, "my string",
                  "field2", SELOG_TYPE_I32, -5,
                  "peer_ip", SELOG_TYPE_IPV4STR, "192.168.0.1"
                  "subset!a", SELOG_TYPE_BOOLSTR, "yes",
                  "subset!b", SELOG_TYPE_BIN, 10, bindata,
                  NULL);
Regards, Botond _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers
lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers
-- Thank you, Dmitri Pal

Sr. Engineering Manager IPA project, Red Hat Inc.

Looking to carve out IT costs? www.redhat.com/carveoutcosts/

lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

kroberts＠redhat.com

2:52 p.m.

On 03/26/2012 01:36 PM, William Heinbockel wrote:

...

I don't see what the issue is. The underlying data store doesn't have to support all of the different types; it is up to the application to make sure things are stored/recalled correctly.

Agree. This is the responsibility of the marshaller to decide the resulting type of the "key".

JSON isn't strongly typed while XML is (or can be). We shouldn't try to force JSON to convey this meta-data if it wasn't designed to. It would end up as a clunky bolt-on.

Finally, I think that we can and should provide a taxonomy for certain well known keys (e.g. port, hostname, etc.). This will make it easier for log consumers to know what they're actually receiving and make intelligent decisions about how to parse it.

David Lang

25 Mar 25 Mar

9:11 p.m.

On Sat, 24 Mar 2012, Botond Botyanszki wrote:

...

On Sat, 24 Mar 2012 12:28:25 -0400 Dmitri Pal dpal@redhat.com wrote:

...

Map them to the C types:

int32, int64, uint32, uint64, double - map one to one to C

What's the benefit of having both int32 and int64 and unsigned versions? I still think that this is an overkill instead of using only a single 'number' type which would be int64. Just like in JSON. For the cases where this is not optimal, the custom types can solve this problem.

The number does not fit into signed 64 bit: create a special type (bigint) and use that.

The number is a lot smaller (e.g. enum) and 64 bit is considered a waste for its storage: create a special type (such as severitylevel) where you know that the value can fit in a smaller storage (e.g. uint8).

Having this many numeric types also complicates code when you need to implement operations with these.

the big advantage of supporting all the different types as input to the logging call is the ease of getting data into the log from the programmer's point of view (by eliminating the need to do type conversion first)

David Lang

William Heinbockel

26 Mar 26 Mar

11:11 a.m.

On Sun, Mar 25, 2012 at 9:11 PM, david@lang.hm wrote:

...

On Sat, 24 Mar 2012, Botond Botyanszki wrote:

...
On Sat, 24 Mar 2012 12:28:25 -0400 Dmitri Pal dpal@redhat.com wrote:

...

Map them to the C types:

int32, int64, uint32, uint64, double - map one to one to C

What's the benefit of having both int32 and int64 and unsigned versions? I still think that this is an overkill instead of using only a single 'number' type which would be int64. Just like in JSON. For the cases where this is not optimal, the custom types can solve this problem.

The number does not fit into signed 64 bit: create a special type

(bigint) and use that.

The number is a lot smaller (e.g. enum) and 64 bit is considered a

waste for its storage: create a special type (such as severitylevel) where you know that the value can fit in a smaller storage (e.g. uint8). Having this many numeric types also complicates code when you need to implement operations with these.

the big advantage of supporting all the different types as input to the logging call is the ease of getting data into the log from the programmer's point of view (by eliminating the need to do type conversion first)

David Lang

I think that supporting arbitrary precision natively in a log library is overkill and will probably lead to issues later on. Also, I fail to see a good use case where such functionality is necessary.

You also run into issues with conformance with existing implementations. Most JSON and XML parsers have explicit limits on their floating point and integer handling abilities. For instance, you will lose precision for any integer with >52 bit precision because the underlying JSON number type is double in a large percentage of libraries.

I do not mind supporting a single number type (which will be supported in the API as either a char * or a double) or supporting several native int/uint choices.

William Heinbockel

10:42 a.m.

...

Here is the rules I suggest we follow:

Identify the types of data we commit to logically recognize at the

API level - the list above is a good starting point. These types are known at the API level. Our goal is to preserve these types in all transformations otherwise IMO we did not accomplish our goal. Dropping types in transition that have been clearly identified at the API level just because we chose transport that does not support it is a non-starter.

Map them to the C types:

int32, int64, uint32, uint64, double - map one to one to C

bool - int or char (implementation detail)

string - null terminated string i.e. char * with length calculated nased on the NULL terminated string

octet array - void * and it needs to keep the length, lenght is a required argument at the API level BTW

timestamp - is actually two parts but I would argue that timestamp is added at the moment of logging automatically so it is not exposed at the API level but kept internally as pair of numeric fields (msec since epoch, tzone offset)

This holds only for 1 timestamp. If I have an event that lasts for a non-trivial period of time, I may want to have start/end timestamps, or I may wish to include something such at the sent time of an email or other external timestamp.

...

All the rest:

ipv4, ipv6, mac

UUID/GUID

host/fqdn

email

Are subtypes of string meaning that on the API level they will be passed to us via non string type but char * as data.

I would imagine that we would want the IPv4/IPv6 handled as binary, same with mac and UUIDs.

...

Anything else that we do not recognize as a subtype has to be expressed with existing type

Right, which would be char *

...

If the transport format (JSON, XML, you name it) does not support the

type we chose follow a convention of expressing subtypes via the name of the attribute.

I vote to drop the type information when expressed in JSON (or at least make it optional). Keep it simple.

...

So in JSON serialization I will add the suffix and in XML it can be a type element or a type attribute as Bill suggested.

I plan to make a first stab at this API over the weekend.

kroberts＠redhat.com

4:48 p.m.

On 03/26/2012 10:42 AM, William Heinbockel wrote:

...

I vote to drop the type information when expressed in JSON (or at least make it optional). Keep it simple.

I agree that typing information should be optional. If we want to come up with a way to specify typing in JSON... fine. It really doesn't matter to me how we do it as long as it is *easy* and existing JSON parsers can handle it. I'm testing "net.sf.json" with the suggested format in [1] and will let you know how easy it is to parse.

The most important thing that we should all keep in mind here is *easy*. Syslog is successful because it is easy. log4j is successful because it is easy. If lumberjack/CEE is hard it will be DOA and developers will shun it.

[1] { ... "myKey:myType", "abc123" }

David Lang

5:48 p.m.

On Mon, 26 Mar 2012, Keith Robertson wrote:

...

On 03/26/2012 10:42 AM, William Heinbockel wrote:

...
I vote to drop the type information when expressed in JSON (or at least make it optional). Keep it simple.

I agree that typing information should be optional. If we want to come up with a way to specify typing in JSON... fine. It really doesn't matter to me how we do it as long as it is *easy* and existing JSON parsers can handle it. I'm testing "net.sf.json" with the suggested format in [1] and will let you know how easy it is to parse.

The most important thing that we should all keep in mind here is *easy*. Syslog is successful because it is easy. log4j is successful because it is easy. If lumberjack/CEE is hard it will be DOA and developers will shun it.

+1 is too mild a way of expressing agreement with this.

This overcomplication is the biggest thing I am worried about. We need to make it as easy to use as possible. The fact that most people don't care about the logs means that we have an uphill fight to start with to get people to change how they log and comply with some standard way of doing things.

peversely, this is why I think that the memory-structure logging API needs to support so many different types, to make it as trivial as possible for people take data they already have in variables and hand it off to the logging system in a structured manner. If there was a way for the C function call to autodetect the type of a variable that the pointer is pointing at and just 'do the right thing', I would not care much there either.

In many ways, everyone on this list is unqualified to evaluate this, because all of us think that logs are very important and need to be analyzed. As a result, our natural bias is to optimize for the consumption of logs. The thing that we all need to keep in mind is that if we don't make it easy enough for people to comply with our requirements, they just won't and the result will be useless.

I've seen Java apps where the logging becomes useless because the programmers writing a log message fail to comply with the java standard of indentation in a multi-line log message. I know of at least one case where such a problem existed for several years with nobody careing, but it makes automating dealing with the log message just about impossible as it breakes the fundamental way of telling where one log entry ends and the next begins.

We won't have this particular problem, but if we aren't careful, people will start making 'structured' logs that have the header info and then one field called "errortext" that contains all the ugly, unstructed stuff that they used to dump into syslog or into their flat logfile. Take a look at what Windows programmers do in sending messages to the eventlog as an example of this problem in practice.

</soapbox>

David Lang

William Heinbockel

7:04 p.m.

On Mon, Mar 26, 2012 at 5:48 PM, david@lang.hm wrote:

...

On Mon, 26 Mar 2012, Keith Robertson wrote:

...
On 03/26/2012 10:42 AM, William Heinbockel wrote:

...
I vote to drop the type information when expressed in JSON (or at least make it optional). Keep it simple.

I agree that typing information should be optional. If we want to come up with a way to specify typing in JSON... fine. It really doesn't matter to me how we do it as long as it is *easy* and existing JSON parsers can handle it. I'm testing "net.sf.json" with the suggested format in [1] and will let you know how easy it is to parse.

The most important thing that we should all keep in mind here is *easy*. Syslog is successful because it is easy. log4j is successful because it is easy. If lumberjack/CEE is hard it will be DOA and developers will shun it.

+1 is too mild a way of expressing agreement with this.

<soapbox>

This overcomplication is the biggest thing I am worried about. We need to make it as easy to use as possible. The fact that most people don't care about the logs means that we have an uphill fight to start with to get people to change how they log and comply with some standard way of doing things.

The easiest way is to utilize the absolute minimal set of types necessary and existing protocols/parsers. Let the applications handle the type inferencing, etc.

...

peversely, this is why I think that the memory-structure logging API needs to support so many different types, to make it as trivial as possible for people take data they already have in variables and hand it off to the logging system in a structured manner. If there was a way for the C function call to autodetect the type of a variable that the pointer is pointing at and just 'do the right thing', I would not care much there either.

I agree that the API needs to be easy to use. That is what I love about Gergely's `lumberlog` -- it is very minimal. No worrying about types, etc. Let the calling application to that.

...

In many ways, everyone on this list is unqualified to evaluate this, because all of us think that logs are very important and need to be analyzed. As a result, our natural bias is to optimize for the consumption of logs. The thing that we all need to keep in mind is that if we don't make it easy enough for people to comply with our requirements, they just won't and the result will be useless.

I've talked to enough major SIEM/log mgmt vendors to know that they already have parsers and do enough mangling and normalization of data structures that it wouldn't matter how many types you threw in, they would still have the same amount of effort to support them.

The major consuming use cases are:

* text search (strcmp, starts-with, ends-with, regex...) * numeric comparison (!= > < >= <= ==) * time ranges * IPv4/IPv6 CIDR address ranges * Boolean test (e.g., where user account is locked)

Extrapolating this for types, you need to support:

* string * number * timestamp/datetime * IPv4/IPv6 * boolean

William Heinbockel

Botond Botyanszki

27 Mar 27 Mar

5:22 a.m.

On Mon, 26 Mar 2012 19:04:58 -0400 William Heinbockel wheinbockel@gmail.com wrote:

...

I've talked to enough major SIEM/log mgmt vendors to know that they already have parsers and do enough mangling and normalization of data structures that it wouldn't matter how many types you threw in, they would still have the same amount of effort to support them.

The major consuming use cases are:

text search (strcmp, starts-with, ends-with, regex...)

numeric comparison (!= > < >= <= ==)

time ranges

IPv4/IPv6 CIDR address ranges

Boolean test (e.g., where user account is locked)

Extrapolating this for types, you need to support:

string

number

timestamp/datetime

IPv4/IPv6

boolean

Great summary. BTW we have the exact same types in nxlog, with the addition of the rarely used 'binary' type.

Regards, Botond

Dmitri Pal

26 Mar 26 Mar

7:07 p.m.

On 03/26/2012 05:48 PM, david@lang.hm wrote:

...

On Mon, 26 Mar 2012, Keith Robertson wrote:

...
On 03/26/2012 10:42 AM, William Heinbockel wrote:

...
I vote to drop the type information when expressed in JSON (or at least make it optional). Keep it simple.

I agree that typing information should be optional. If we want to come up with a way to specify typing in JSON... fine. It really doesn't matter to me how we do it as long as it is *easy* and existing JSON parsers can handle it. I'm testing "net.sf.json" with the suggested format in [1] and will let you know how easy it is to parse.

The most important thing that we should all keep in mind here is *easy*. Syslog is successful because it is easy. log4j is successful because it is easy. If lumberjack/CEE is hard it will be DOA and developers will shun it.

+1 is too mild a way of expressing agreement with this.

<soapbox>

This overcomplication is the biggest thing I am worried about. We need to make it as easy to use as possible. The fact that most people don't care about the logs means that we have an uphill fight to start with to get people to change how they log and comply with some standard way of doing things.

peversely, this is why I think that the memory-structure logging API needs to support so many different types, to make it as trivial as possible for people take data they already have in variables and hand it off to the logging system in a structured manner. If there was a way for the C function call to autodetect the type of a variable that the pointer is pointing at and just 'do the right thing', I would not care much there either.

In many ways, everyone on this list is unqualified to evaluate this, because all of us think that logs are very important and need to be analyzed. As a result, our natural bias is to optimize for the consumption of logs. The thing that we all need to keep in mind is that if we don't make it easy enough for people to comply with our requirements, they just won't and the result will be useless.

I've seen Java apps where the logging becomes useless because the programmers writing a log message fail to comply with the java standard of indentation in a multi-line log message. I know of at least one case where such a problem existed for several years with nobody careing, but it makes automating dealing with the log message just about impossible as it breakes the fundamental way of telling where one log entry ends and the next begins.

We won't have this particular problem, but if we aren't careful, people will start making 'structured' logs that have the header info and then one field called "errortext" that contains all the ugly, unstructed stuff that they used to dump into syslog or into their flat logfile. Take a look at what Windows programmers do in sending messages to the eventlog as an example of this problem in practice.

</soapbox>

Adding type is just work at this point. I see both sides of the argument. I would thin that we need to have the default int and the other int expressed as subtypes. But this can be added. I am more concerned whether the interface I suggested is simple enough.

IMO there should be different levels of the interface: 1) Simple calls: open with defaults, close, log data 2) Advanced calls that allow reusing of the sets 3) Really advanced calls that allow fine tuneup and optimization

This fits all the tastes. And allow applications to evolve if needed.

...

David Lang _______________________________________________ lumberjack-developers mailing list lumberjack-developers@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/lumberjack-developers

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Dmitri Pal

24 Mar 24 Mar

12:09 p.m.

On 03/24/2012 07:18 AM, david@lang.hm wrote:

...

On Fri, 23 Mar 2012, Dmitri Pal wrote:

...
...
...
What do you think?

it's redundant to have TYPE_* and the type as part of the field name, we should do one or the other, not both.

OK works for me but how it would look over the wire in JSON? Should the serialization the add the type suffix to overcome the serialization limitations?

I would say that if the protocol (in this case JSON) doesn't support a specific type, the type information is silently lost.

string derived types will devolve to string

number types will devolve to number

David Lang

No I do not like it. It defeats the purpose of what we are doing here.

-- Thank you, Dmitri Pal Sr. Engineering Manager IPA project, Red Hat Inc. ------------------------------- Looking to carve out IT costs? www.redhat.com/carveoutcosts/

Botond Botyanszki

23 Mar 23 Mar

12:03 p.m.

On Thu, 22 Mar 2012 15:36:39 -0700 (PDT) david@lang.hm wrote:

...

I disagree with the idea that the Integer must fit in a 64 bit field and am not thrilled at float being a separate type (although with the errors that can creep in with conversion to/from float it may be thr right thing to do)

there will be larger numbers to pass than will fit in 64 bit, having to have the software be able to handle string type values to deal with such numbers seems like a bad idea.

There can be a bigint type in addition to the standard 64 bit. Making everything bigint would impose a performance overhead.

...

...
Other vendor/app specific types would default to string if the type is not known.

Yes, this is a good fallback, but I think we should really discourage vendor/app specific types.

Why discourage? There are be countless examples for such data types (e.g. zip code, postal address etc). This will give a chance for the vendor to validate and/or treat this kind of data specially in his code, yet elsewhere it is perfectly fine to treat it as a string.

...

...
In addition to the above CEE types, CEE would define a list of field names and their default types. If there is no explicit type information in the JSON/XML log, the code can look up the type of the field from the list and use it as default.

personally, I am doubtful as to the value of the CEE message types. I'm all in favor of structured logging, but I have doubts as to the possibility of standardizing the messages. If it happens, great, but I expect that if they do structured logging at all, many vendors will end up with their own message types.

I thought the goal of the CEE standardization effort is to solve this too.

Regards, Botond

David Lang

5:47 p.m.

On Fri, 23 Mar 2012, Botond Botyanszki wrote:

...

On Thu, 22 Mar 2012 15:36:39 -0700 (PDT) david@lang.hm wrote:

...
I disagree with the idea that the Integer must fit in a 64 bit field and am not thrilled at float being a separate type (although with the errors that can creep in with conversion to/from float it may be thr right thing to do)

there will be larger numbers to pass than will fit in 64 bit, having to have the software be able to handle string type values to deal with such numbers seems like a bad idea.

There can be a bigint type in addition to the standard 64 bit. Making everything bigint would impose a performance overhead.

valid point. This is a trade-off between performance overhead of making everything bigint and the complexity of lots of different number types.

...

...
...
Other vendor/app specific types would default to string if the type is not known.

Yes, this is a good fallback, but I think we should really discourage vendor/app specific types.

Why discourage? There are be countless examples for such data types (e.g. zip code, postal address etc). This will give a chance for the vendor to validate and/or treat this kind of data specially in his code, yet elsewhere it is perfectly fine to treat it as a string.

The issue is that this tends to lead to vendor lock-in because this vendor's product only works with the same vendors's message passing protocol and only works that the same vendors log analysis tools

...

...
...
In addition to the above CEE types, CEE would define a list of field names and their default types. If there is no explicit type information in the JSON/XML log, the code can look up the type of the field from the list and use it as default.

personally, I am doubtful as to the value of the CEE message types. I'm all in favor of structured logging, but I have doubts as to the possibility of standardizing the messages. If it happens, great, but I expect that if they do structured logging at all, many vendors will end up with their own message types.

I thought the goal of the CEE standardization effort is to solve this too.

That's the goal, I just am skeptical of the success.

David Lang

Botond Botyanszki

6:05 p.m.

On Fri, 23 Mar 2012 14:47:54 -0700 (PDT) david@lang.hm wrote:

...

...
...
...
Other vendor/app specific types would default to string if the type is not known.

Yes, this is a good fallback, but I think we should really discourage vendor/app specific types.

Why discourage? There are be countless examples for such data types (e.g. zip code, postal address etc). This will give a chance for the vendor to validate and/or treat this kind of data specially in his code, yet elsewhere it is perfectly fine to treat it as a string.

The issue is that this tends to lead to vendor lock-in because this vendor's product only works with the same vendors's message passing protocol and only works that the same vendors log analysis tools

If unknown types default to string, then this should not present a problem. It all depends how this value is represented as a string. If it is human readable (think about a postal address), then it is still mostly useful for a receiver system which does not know how to handle the value (break it into state, city, street etc) other than treating it as a single string. If a vendor wants to avoid this, then they will create an encrypted binary hexdump, send that as a string and not publish the format. In my opinion custom types provide more flexibility, vendor lock-in stems from elsewhere.

Regards, Botond

4474

Age (days ago)

4480

Last active (days ago)

lumberjack-developers@lists.fedorahosted.org

83 comments

8 participants

tags (0)

participants (8)

Botond Botyanszki
David Lang
Dmitri Pal
Gergely Nagy
Heinbockel, Bill
kroberts＠redhat.com
Rainer Gerhards
William Heinbockel