Machine-readable text format for log and config files

Fri Feb 17 08:05:00 UTC 2012

Hello All.

I'm just a Linux Russian GNU/Linux user and I would like to propose
one simple idea. I'll try to be as brief as possible.

A few months ago a proposal of new binary log system coupled with
systemd has been made by Lennart Poettering. It rose numerous
emotional discussions in Russian Linux community, a lot of users have
seen severe unix-way violation there and thus wouldn't like to adopt
it.

The new log system addresses the issues outlined by Lennart in his
intro on https://docs.google.com/document/pub?id=1IC9yOXj7j6cdLLxWEBAGRL6wl97tFxgjLUEHIX3MSTs
At first I wanted to make a comparison against the proposed approach,
but now I don't want this message to be that long. However, one can
state that all the goals, mentioned there, can be achieved within my
method combined with some external tools if necessary.

I'd like to propose possibly less destructive but yet efficient
solution for this problem: machine-readable text logs. The most famous
such format is, definitely, JSON. I will use it to illustrate my
ideas, but it's absolutely not the best one for this purpose.

Long story short, here is possible single log entry for apache:

{ "date": "2011-11-23 23:25:36.0545 +0400", "pid": 2104,
"name":"apache2", "severity": 1, "sha1":
2a5162d0e83756bd559e13d13a5a7651fbe0d068, ...
"msg": { "ip": "127.0.0.1", "req": "GET", "file": "index.html",
"http_version": 1.0, "result": 200 }  }

Note that this is complete JSON object consisting of two lines: first
is syslog header, second is application-defined message enclosed in
"msg" object. This allows to use traditional tools like grep for log
processing - not much changes here. And you can still read log with a
naked eye.

What makes difference is that we've created some structured objects
within simple text file and can now create automated tools to analyze
log of any program. Something like this:

jparser "SELECT date, msg.file FROM apache.log WHERE msg.code=200 AND
date >2011.11.22"

So we can make queries directly to log files with a tool similar to
XPath for JSON (JSONPath). But probably a better solution would be to
import log into a NoSQL database and use all its enormous power and
performance for further log processing. Note that this requires almost
zero effort since all such tools already exist.

Advantages over plain-text log files:
- Easy and unambigous parsing
- Traditional processing can be used more effectively combined with
some preprocessing
- Strict and formal structure of most important fields
- Unified format fot logs, config files, utilities output etc.
- Queries can be made directly to log file
- Applications can easily add their own fields with mere printf()
without breaking any processing tools
- Log can be directly imported into powerfull NoSQL database
- Binary blobs can be unobtrusively added as bas64-encoded fields
- Multiple logs can be processed at once with something similar to SQL
JOIN operator
- Index can be easily built separately for fast searching

Advantages over proposed binary log files:
- Backward compatibility (mostly) with existing tools
- Much simpler to emit
- More reliable and can be fixed with just VIM
- Existing syslogs doesn't even need to be heavily modified
- Direct import into existing fast and powerfull NoSQL databases
(almost no conversion required)
- Format is easiliy extendable withou breaking any compatibility
- Applications can make logs without any external libraries with mere printf()
- More advanced goals like clustering still can be achieved with
external methods without bloating the core subsystem

But we can extend this even further and use such formats not only for
log files: since it can describe arbitary data structures, it can be
used for configuration files and in commands output. I was once
looking through the code of linux utils, which take their data from
procfs and the implementation made me cry: each utility has to write
its own buggy and redundant parser for arbitary file format. While if
we had unified text format for data representation, the parsing would
be something like

int free_mem = jparser_get_num ("/proc/meminfo", "FreeMem");
int free_pages = jparser_get_num ("/proc/vmstat", "nr_free_pages" );

and so on. The same applies to grep'ing numerous ulitities (like
ifconfig and others) output. Current scripts are barely readable,
insecure and highly fragile - simple addition of a new parameter or
indentation change will break them. While we could just something
write

ifconfig --json | jparser "interfaces[*].inet"

to get addresses of all interfaces regardless of formatting,
indentation and presence of other params (note that I'm using
different syntax to show different possibilities).

I understand that some work has to be done anyway. For example, the
format definitely needs to be extended with at least variables for
config files, which will make the parser more complicated. I think
that this is all possible to resolve, though. And I would like to
emphasize once again that I don't think that JSON is the best choice
here.

So, what do you think? Are there any chances that something like this
will be introduced in unix configs, utilities and log files? And if
not - why?

PS
Another maybe more readable lisp-style format alternative:
(((:date 2011-11-23 23:25:36.0545 +0400) :pid 2104 :name apache2 ...)
(:msg :ip 127.0.0.1 :req GET :file index.html ...))

--
Alexander Sauta