systemd journal documentation [was Re: What are reasonable blockers for making journald the default] logger in F19?

Matthew Miller mattdm at fedoraproject.org
Sun Oct 21 03:12:08 UTC 2012


On Sun, Oct 21, 2012 at 01:50:59AM +0200, Lennart Poettering wrote:
> The format is documented now:
> http://www.freedesktop.org/wiki/Software/systemd/journal-files

Cool. I think that will address many people's concerns.

I have some incredibly small non-technical grammar suggestions for the
document, which I've included below as a diff. Some of these are just commas
for clarity in log clauses, but others are minor word-choice issues. I don't
think I've altered meaning anywhere.


----

--- journal-files.html.orig	2012-10-20 22:43:29.000000000 -0400
+++ journal-files.html	2012-10-20 23:07:57.000000000 -0400
@@ -121,9 +121,9 @@
 <h2 id="Basics">Basics</h2>
 <span class="anchor" id="line-27"></span><span class="anchor" id="line-28"></span><ul><li>All offsets, sizes, time values, hashes (and most other numeric values) are 64bit unsigned integers in LE format. <span class="anchor" id="line-29"></span></li><li>Offsets are always relative to the beginning of the file. <span class="anchor" id="line-30"></span></li><li><p class="line862">The 64bit hash function used is <a class="https" href="https://en.wikipedia.org/wiki/Jenkins_hash_function">Jenkins lookup3</a>, more specifically jenkins_hashlittle2() with the first 32bit integer it returns as higher 32bit part of the 64bit value, and the second one uses as lower 32bit part. <span class="anchor" id="line-31"></span></li><li>All structures are aligned to 64bit boundaries and padded to multiples of 64bit <span class="anchor" id="line-32"></span></li><li>The format is designed to be read and written via memory mapping using multiple mapped windows. <span class="anchor" id="line-33"></span></li><li>All time values are stored in usec since the respective epoch. <span class="anchor" id="line-34"></span></li><li>Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (CLOCK_REALTIME) <span class="anchor" id="line-35"></span></li><li>Monotonic time values are always stored jointly with the kernel boot ID value (i.e. /proc/sys/kernel/random/boot_id) they belong to. They tend to be relative to the begin of the boot, but aren't for containers. (CLOCK_MONOTONIC) <span class="anchor" id="line-36"></span></li><li>Randomized, unique 128bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement. <span class="anchor" id="line-37"></span><span class="anchor" id="line-38"></span></li></ul><p class="line867">
 <h2 id="General_Rules">General Rules</h2>
-<span class="anchor" id="line-39"></span><span class="anchor" id="line-40"></span><p class="line874">If any kind of corruption is noticed by a writer it should immediately rotate the file and start a new one. No further writes should be attempted to the original file, but it should be left around so that as little data as possible is lost. <span class="anchor" id="line-41"></span><span class="anchor" id="line-42"></span><p class="line874">If any kind of corruption is noticed by a reader it should try hard to handle this gracefully, such as skipping over the corrupted data, but allowing access to as much data around it as possible. <span class="anchor" id="line-43"></span><span class="anchor" id="line-44"></span><p class="line874">A reader should verify all offsets and other data as it reads it. This includes checking for alignment and range of offsets in the file, especially before trying to read it via a memory map. <span class="anchor" id="line-45"></span><span class="anchor" id="line-46"></span><p class="line874">A reader must interleave rotated and corrupted files as good as possible and present them as single stream to the user. <span class="anchor" id="line-47"></span><span class="anchor" id="line-48"></span><p class="line874">All fields marked as "reserved" must be initialized with 0 when writing and be ignored on reading. They are currently not used but might be used later on. <span class="anchor" id="line-49"></span><span class="anchor" id="line-50"></span><p class="line867">
+<span class="anchor" id="line-39"></span><span class="anchor" id="line-40"></span><p class="line874">If any kind of corruption is noticed by a writer it should immediately rotate the file and start a new one. No further writes should be attempted to the original file, but it should be left around so that as little data as possible is lost. <span class="anchor" id="line-41"></span><span class="anchor" id="line-42"></span><p class="line874">If any kind of corruption is noticed by a reader it should try hard to handle this gracefully, such as skipping over the corrupted data, but allowing access to as much data around it as possible. <span class="anchor" id="line-43"></span><span class="anchor" id="line-44"></span><p class="line874">A reader should verify all offsets and other data as it reads it. This includes checking for alignment and range of offsets in the file, especially before trying to read it via a memory map. <span class="anchor" id="line-45"></span><span class="anchor" id="line-46"></span><p class="line874">A reader must interleave rotated and corrupted files as well as possible and present them as single stream to the user. <span class="anchor" id="line-47"></span><span class="anchor" id="line-48"></span><p class="line874">All fields marked as "reserved" must be initialized with 0 when writing and be ignored on reading. They are currently not used but might be used later on. <span class="anchor" id="line-49"></span><span class="anchor" id="line-50"></span><p class="line867">
 <h2 id="Structure">Structure</h2>
-<span class="anchor" id="line-51"></span><span class="anchor" id="line-52"></span><p class="line862">The file format's data structures are declared in <a class="http" href="http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-def.h">journal-def.h</a>. <span class="anchor" id="line-53"></span><span class="anchor" id="line-54"></span><p class="line874">The file format begins with a header structure. After the header structure object structures follow. Objects are appended to the end as time progresses. Most data stored in these objects is not altered anymore after having been written once, with the exception of records necessary for indexing. When new data is appended to a file the writer first writes all new objects to the end of the file, and then links them up at front after that's done. Currently, seven different object types are known: <span class="anchor" id="line-55"></span><span class="anchor" id="line-56"></span><p class="line867"><span class="anchor" id="line-57"></span><span class="anchor" id="line-58"></span><span class="anchor" id="line-59"></span><span class="anchor" id="line-60"></span><span class="anchor" id="line-61"></span><span class="anchor" id="line-62"></span><span class="anchor" id="line-63"></span><span class="anchor" id="line-64"></span><span class="anchor" id="line-65"></span><span class="anchor" id="line-66"></span><span class="anchor" id="line-67"></span><span class="anchor" id="line-68"></span><pre><span class="anchor" id="line-1"></span>enum {
+<span class="anchor" id="line-51"></span><span class="anchor" id="line-52"></span><p class="line862">The file format's data structures are declared in <a class="http" href="http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-def.h">journal-def.h</a>. <span class="anchor" id="line-53"></span><span class="anchor" id="line-54"></span><p class="line874">The file format begins with a header structure. After the header structure object structures follow. Objects are appended to the end as time progresses. Most data stored in these objects is not altered after having been written once, with the exception of records necessary for indexing. When new data is appended to a file the writer first writes all new objects to the end of the file, and then links them up at front after that's done. Currently, seven different object types are known: <span class="anchor" id="line-55"></span><span class="anchor" id="line-56"></span><p class="line867"><span class="anchor" id="line-57"></span><span class="anchor" id="line-58"></span><span class="anchor" id="line-59"></span><span class="anchor" id="line-60"></span><span class="anchor" id="line-61"></span><span class="anchor" id="line-62"></span><span class="anchor" id="line-63"></span><span class="anchor" id="line-64"></span><span class="anchor" id="line-65"></span><span class="anchor" id="line-66"></span><span class="anchor" id="line-67"></span><span class="anchor" id="line-68"></span><pre><span class="anchor" id="line-1"></span>enum {
 <span class="anchor" id="line-2"></span>        OBJECT_UNUSED,
 <span class="anchor" id="line-3"></span>        OBJECT_DATA,               
 <span class="anchor" id="line-4"></span>        OBJECT_FIELD,
@@ -133,7 +133,7 @@
 <span class="anchor" id="line-8"></span>        OBJECT_ENTRY_ARRAY,        
 <span class="anchor" id="line-9"></span>        OBJECT_TAG,
 <span class="anchor" id="line-10"></span>        _OBJECT_TYPE_MAX
-<span class="anchor" id="line-11"></span>};</pre><span class="anchor" id="line-69"></span><span class="anchor" id="line-70"></span><ul><li><p class="line862">A <strong>DATA</strong> object, which encapsulates the contents of one field of an entry, i.e. a string such as "_SYSTEMD_UNIT=avahi-daemon.service", or "MESSAGE=Foobar made a booboo." but possibly including large or binary data, and always prefixed by the field name and "=". <span class="anchor" id="line-71"></span></li><li><p class="line862">A <strong>FIELD</strong> object, which encapsulates a field name, i.e. a string such as "_SYSTEMD_UNIT" or "MESSAGE", without any "=" or even value. <span class="anchor" id="line-72"></span></li><li><p class="line862">An <strong>ENTRY</strong> object, which binds several <strong>DATA</strong> objects together into a log entry. <span class="anchor" id="line-73"></span></li><li><p class="line862">A <strong>DATA_HASH_TABLE</strong> object, which encapsulates a hash table for finding existing <strong>DATA</strong> objects. <span class="anchor" id="line-74"></span></li><li><p class="line862">A <strong>FIELD_HASH_TABLE</strong> object, which encapsulates a hash table for finding existing <strong>FIELD</strong> objects. <span class="anchor" id="line-75"></span></li><li><p class="line862">An <strong>ENTRY_ARRAY</strong> object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search. <span class="anchor" id="line-76"></span></li><li><p class="line862">A <strong>TAG</strong> object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later). <span class="anchor" id="line-77"></span><span class="anchor" id="line-78"></span></li></ul><p class="line867">
+<span class="anchor" id="line-11"></span>};</pre><span class="anchor" id="line-69"></span><span class="anchor" id="line-70"></span><ul><li><p class="line862">A <strong>DATA</strong> object, which encapsulates the contents of one field of an entry, i.e. a string such as "_SYSTEMD_UNIT=avahi-daemon.service", or "MESSAGE=Foobar made a booboo." but possibly including large or binary data, and always prefixed by the field name and "=". <span class="anchor" id="line-71"></span></li><li><p class="line862">A <strong>FIELD</strong> object, which encapsulates a field name, i.e. a string such as "_SYSTEMD_UNIT" or "MESSAGE", without "=" or any value. <span class="anchor" id="line-72"></span></li><li><p class="line862">An <strong>ENTRY</strong> object, which binds several <strong>DATA</strong> objects together into a log entry. <span class="anchor" id="line-73"></span></li><li><p class="line862">A <strong>DATA_HASH_TABLE</strong> object, which encapsulates a hash table for finding existing <strong>DATA</strong> objects. <span class="anchor" id="line-74"></span></li><li><p class="line862">A <strong>FIELD_HASH_TABLE</strong> object, which encapsulates a hash table for finding existing <strong>FIELD</strong> objects. <span class="anchor" id="line-75"></span></li><li><p class="line862">An <strong>ENTRY_ARRAY</strong> object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search. <span class="anchor" id="line-76"></span></li><li><p class="line862">A <strong>TAG</strong> object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later). <span class="anchor" id="line-77"></span><span class="anchor" id="line-78"></span></li></ul><p class="line867">
 <h2 id="Header">Header</h2>
 <span class="anchor" id="line-79"></span><span class="anchor" id="line-80"></span><p class="line874">The Header struct defines, well, you guessed it, the file header: <span class="anchor" id="line-81"></span><span class="anchor" id="line-82"></span><p class="line867"><span class="anchor" id="line-83"></span><span class="anchor" id="line-84"></span><span class="anchor" id="line-85"></span><span class="anchor" id="line-86"></span><span class="anchor" id="line-87"></span><span class="anchor" id="line-88"></span><span class="anchor" id="line-89"></span><span class="anchor" id="line-90"></span><span class="anchor" id="line-91"></span><span class="anchor" id="line-92"></span><span class="anchor" id="line-93"></span><span class="anchor" id="line-94"></span><span class="anchor" id="line-95"></span><span class="anchor" id="line-96"></span><span class="anchor" id="line-97"></span><span class="anchor" id="line-98"></span><span class="anchor" id="line-99"></span><span class="anchor" id="line-100"></span><span class="anchor" id="line-101"></span><span class="anchor" id="line-102"></span><span class="anchor" id="line-103"></span><span class="anchor" id="line-104"></span><span class="anchor" id="line-105"></span><span class="anchor" id="line-106"></span><span class="anchor" id="line-107"></span><span class="anchor" id="line-108"></span><span class="anchor" id="line-109"></span><span class="anchor" id="line-110"></span><span class="anchor" id="line-111"></span><span class="anchor" id="line-112"></span><span class="anchor" id="line-113"></span><span class="anchor" id="line-114"></span><span class="anchor" id="line-115"></span><pre><span class="anchor" id="line-1-1"></span>_packed_ struct Header {
 <span class="anchor" id="line-2-1"></span>        uint8_t signature[8]; /* "LPKSHHRH" */
@@ -166,9 +166,9 @@
 <span class="anchor" id="line-29"></span>        /* Added in 189 */
 <span class="anchor" id="line-30"></span>        le64_t n_tags;
 <span class="anchor" id="line-31"></span>        le64_t n_entry_arrays;
-<span class="anchor" id="line-32"></span>};</pre><span class="anchor" id="line-116"></span><span class="anchor" id="line-117"></span><p class="line874">The first 8 bytes of Journal files must contain the ASCII characters LPKSHHRH. <span class="anchor" id="line-118"></span><span class="anchor" id="line-119"></span><p class="line862">If a writer finds that the <strong>machine_id</strong> of a file to write to does not match the machine it is running on it should immediately rotate the file and start a new one. <span class="anchor" id="line-120"></span><span class="anchor" id="line-121"></span><p class="line862">When journal file is first created the <strong>file_id</strong> is randomly and uniquely initialized. <span class="anchor" id="line-122"></span><span class="anchor" id="line-123"></span><p class="line862">When a writer opens a file it shall initialize the <strong>boot_id</strong> to the current boot id of the system. <span class="anchor" id="line-124"></span><span class="anchor" id="line-125"></span><p class="line862">The currently used part of the file is the <strong>header_size</strong> plus the <strong>arena_size</strong> field of the header. If a writer needs to write to a file where the actual file size on disk is smaller than the reported value it shall immediately rotate the file and start a new one. If a write is asked to write to a file with a header that is shorter than his own definition of the struct Header, he shall immediately rotate the file and start a new one. <span class="anchor" id="line-126"></span><span class="anchor" id="line-127"></span><p class="line862">The <strong>n_objects</strong> field contains the a counter for objects currently available in this file. As objects are appended to the end of the file this counters is increased. <span class="anchor" id="line-128"></span><span class="anchor" id="line-129"></span><p class="line862">The first object in the file starts immediately after the header. The last object in the file is at the offset <strong>tail_object_offset</strong>, which may be 0 if no object is in the file yet. <span class="anchor" id="line-130"></span><span class="anchor" id="line-131"></span><p class="line862">The <strong>n_entries</strong>, <strong>n_data</strong>, <strong>n_fields</strong>, <strong>n_tags</strong>, <strong>n_entry_arrays</strong> are counters of the objects of the specific types. <span class="anchor" id="line-132"></span><span class="anchor" id="line-133"></span><p class="line867"><strong>tail_entry_seqnum</strong> and <strong>head_entry_seqnum</strong> contain the sequential number (see below) of the last or first entry in the file, respectively, or 0 if no entry has been written yet. <span class="anchor" id="line-134"></span><span class="anchor" id="line-135"></span><p class="line867"><strong>tail_entry_realtime</strong> and <strong>head_entry_realtime</strong> contain the wallclock timestamp of the last or first entry in the file, respectively, or 0 if no entry has been written yet. <span class="anchor" id="line-136"></span><span class="anchor" id="line-137"></span><p class="line867"><strong>tail_entry_monotonic</strong> is the monotonic timestamp of the last entry in the file, referring to monotonic time of the boot identified by <strong>boot_id</strong>. <span class="anchor" id="line-138"></span><span class="anchor" id="line-139"></span><p class="line867">
+<span class="anchor" id="line-32"></span>};</pre><span class="anchor" id="line-116"></span><span class="anchor" id="line-117"></span><p class="line874">The first 8 bytes of Journal files must contain the ASCII characters LPKSHHRH. <span class="anchor" id="line-118"></span><span class="anchor" id="line-119"></span><p class="line862">If a writer finds that the <strong>machine_id</strong> of a file to write to does not match the machine it is running on it should immediately rotate the file and start a new one. <span class="anchor" id="line-120"></span><span class="anchor" id="line-121"></span><p class="line862">When journal file is first created the <strong>file_id</strong> is randomly and uniquely initialized. <span class="anchor" id="line-122"></span><span class="anchor" id="line-123"></span><p class="line862">When a writer opens a file it shall initialize the <strong>boot_id</strong> to the current boot id of the system. <span class="anchor" id="line-124"></span><span class="anchor" id="line-125"></span><p class="line862">The currently used part of the file is the <strong>header_size</strong> plus the <strong>arena_size</strong> field of the header. If a writer needs to write to a file where the actual file size on disk is smaller than the reported value it shall immediately rotate the file and start a new one. If a writer is asked to write to a file with a header that is shorter than its own definition of the struct Header, it shall immediately rotate the file and start a new one. <span class="anchor" id="line-126"></span><span class="anchor" id="line-127"></span><p class="line862">The <strong>n_objects</strong> field contains the a counter for objects currently available in this file. As objects are appended to the end of the file this counters is increased. <span class="anchor" id="line-128"></span><span class="anchor" id="line-129"></span><p class="line862">The first object in the file starts immediately after the header. The last object in the file is at the offset <strong>tail_object_offset</strong>, which may be 0 if no object is in the file yet. <span class="anchor" id="line-130"></span><span class="anchor" id="line-131"></span><p class="line862">The <strong>n_entries</strong>, <strong>n_data</strong>, <strong>n_fields</strong>, <strong>n_tags</strong>, <strong>n_entry_arrays</strong> are counters of the objects of the specific types. <span class="anchor" id="line-132"></span><span class="anchor" id="line-133"></span><p class="line867"><strong>tail_entry_seqnum</strong> and <strong>head_entry_seqnum</strong> contain the sequential number (see below) of the last or first entry in the file, respectively, or 0 if no entry has been written yet. <span class="anchor" id="line-134"></span><span class="anchor" id="line-135"></span><p class="line867"><strong>tail_entry_realtime</strong> and <strong>head_entry_realtime</strong> contain the wallclock timestamp of the last or first entry in the file, respectively, or 0 if no entry has been written yet. <span class="anchor" id="line-136"></span><span class="anchor" id="line-137"></span><p class="line867"><strong>tail_entry_monotonic</strong> is the monotonic timestamp of the last entry in the file, referring to monotonic time of the boot identified by <strong>boot_id</strong>. <span class="anchor" id="line-138"></span><span class="anchor" id="line-139"></span><p class="line867">
 <h2 id="Extensibility">Extensibility</h2>
-<span class="anchor" id="line-140"></span><span class="anchor" id="line-141"></span><p class="line874">The format is supposed to be extensible in order to enable future additions of features. Readers should simply skip objects of unknown types as they read them. If a compatible feature extension is made a new bit is registered in the header's 'compatible_flags' field. If an  feature extension is used that makes the format incompatible a new bit is registered in the header's 'incompatible_flags' field. Readers should check these two bit fields, if they find a flag they don't understand in compatible_flags they should continue to read the file, but if they find one in 'incompatible_flags' they should fail, asking for an update of the software. Writers should refuse writing if there's an unknown bit flag in either of these fields. <span class="anchor" id="line-142"></span><span class="anchor" id="line-143"></span><p class="line874">The file header may be extended as new features are added. The size of the file header is stored in the header. All hader fields up to "n_data" are known to unconditionally exist in all revisions of the file format, all fields starting with "n_data" needs to be explicitly checked for via a size check, since they were additions after the initial release. <span class="anchor" id="line-144"></span><span class="anchor" id="line-145"></span><p class="line874">Currently only two extensions flagged in the flags fields are known: <span class="anchor" id="line-146"></span><span class="anchor" id="line-147"></span><p class="line867"><span class="anchor" id="line-148"></span><span class="anchor" id="line-149"></span><span class="anchor" id="line-150"></span><span class="anchor" id="line-151"></span><span class="anchor" id="line-152"></span><span class="anchor" id="line-153"></span><span class="anchor" id="line-154"></span><span class="anchor" id="line-155"></span><pre><span class="anchor" id="line-1-2"></span>enum {
+<span class="anchor" id="line-140"></span><span class="anchor" id="line-141"></span><p class="line874">The format is supposed to be extensible in order to enable future additions of features. Readers should simply skip objects of unknown types as they read them. If a compatible feature extension is made, a new bit is registered in the header's 'compatible_flags' field. If an  feature extension is used that makes the format incompatible, a new bit is registered in the header's 'incompatible_flags' field. Readers should check these two bit fields, and if they find a flag they don't understand in compatible_flags they should continue to read the file, but if they find one in 'incompatible_flags' they should fail, asking for an update of the software. Writers should refuse writing if there's an unknown bit flag in either of these fields. <span class="anchor" id="line-142"></span><span class="anchor" id="line-143"></span><p class="line874">The file header may be extended as new features are added. The size of the file header is stored in the header. All hader fields up to "n_data" are known to unconditionally exist in all revisions of the file format, all fields starting with "n_data" needs to be explicitly checked for via a size check, since they were additions after the initial release. <span class="anchor" id="line-144"></span><span class="anchor" id="line-145"></span><p class="line874">Currently only two extensions flagged in the flags fields are known: <span class="anchor" id="line-146"></span><span class="anchor" id="line-147"></span><p class="line867"><span class="anchor" id="line-148"></span><span class="anchor" id="line-149"></span><span class="anchor" id="line-150"></span><span class="anchor" id="line-151"></span><span class="anchor" id="line-152"></span><span class="anchor" id="line-153"></span><span class="anchor" id="line-154"></span><span class="anchor" id="line-155"></span><pre><span class="anchor" id="line-1-2"></span>enum {
 <span class="anchor" id="line-2-2"></span>        HEADER_INCOMPATIBLE_COMPRESSED = 1
 <span class="anchor" id="line-3-2"></span>};
 <span class="anchor" id="line-4-2"></span>
@@ -185,7 +185,7 @@
 <h2 id="Sequence_Numbers">Sequence Numbers</h2>
 <span class="anchor" id="line-177"></span><span class="anchor" id="line-178"></span><p class="line862">All entries carry sequence numbers that are monotonically counted up for each entry (starting at 1) and are unique among all files which carry the same <strong>seqnum_id</strong> field. This field is randomly generated when the journal daemon creates its first file. All files generated by the same journal daemon instance should hence carry the same seqnum_id. This should guarantee a monotonic stream of sequential numbers for easy interleaving even if entries are distributed among several files, such as the system journal and many per-user journals. <span class="anchor" id="line-179"></span><span class="anchor" id="line-180"></span><p class="line867">
 <h2 id="Concurrency">Concurrency</h2>
-<span class="anchor" id="line-181"></span><span class="anchor" id="line-182"></span><p class="line862">The file format is designed to be usable in a simultaneous single-writer/multiple-reader scenario. The synchronization model is very weak in order to facilitate storage on the most basic of file systems (well, the most basic ones that provide us with mmap() that is), and allow good performance. No file locking is used. The only time where disk synchronization via fdatasync() should be enforced is after and before changing the <strong>state</strong> field in the file header (see below). It is recommended to execute a memory barrier after appending and initializing new objects at the end of the file, and before linking them up in the earlier objects. <span class="anchor" id="line-183"></span><span class="anchor" id="line-184"></span><p class="line874">This weak synchronization model means that it is crucial that readers verify the structural integrity of the file as they read it and handle invalid structure gracefully. (Checking what you read is a pretty good idea out of security considerations anyway.) This specifically includes checking offset values, and that they point to valid objects, with valid sizes and of the type and hash value expected. All code must be written with the fact in mind that a file with inconsistent structure file might just be inconsistent temporarily, and might become consistent later on. Payload OTOH requires less scrutiny, as it should only be linked up (and hence visible to readers) after it was successfully written to memory (tough not necessarily to disk). On non-local file systems it is a good idea to verify the payload hashes when reading, in order to avoid annoyances with mmap() inconsistencies. <span class="anchor" id="line-185"></span><span class="anchor" id="line-186"></span><p class="line874">Clients intending to show a live view of the journal should use inotify() for this to watch for files changes. Since file writes done via mmap() do not result in inotify() writers shall truncate the file to its current size after writing on or more entries, which results in inotify events being generated. Note that this is not used as transaction scheme (it doesn't protect anything), but merely for triggering wakeups. <span class="anchor" id="line-187"></span><span class="anchor" id="line-188"></span><p class="line874">Note that inotify will not work on network file systems if reader and writer reside on different hosts. Readers which detect they are run on journal files on a non-local file system should hence not rely on inotify for live views but fall back to simple time based polling of the files (maybe recheck every 2s). <span class="anchor" id="line-189"></span><span class="anchor" id="line-190"></span><p class="line867">
+<span class="anchor" id="line-181"></span><span class="anchor" id="line-182"></span><p class="line862">The file format is designed to be usable in a simultaneous single-writer/multiple-reader scenario. The synchronization model is very weak in order to facilitate storage on the most basic of file systems (well, the most basic ones that provide us with mmap(), that is), and allow good performance. No file locking is used. The only time where disk synchronization via fdatasync() should be enforced is after and before changing the <strong>state</strong> field in the file header (see below). It is recommended to execute a memory barrier after appending and initializing new objects at the end of the file, and before linking them up in the earlier objects. <span class="anchor" id="line-183"></span><span class="anchor" id="line-184"></span><p class="line874">This weak synchronization model means that it is crucial that readers verify the structural integrity of the file as they read it and handle invalid structure gracefully. (Checking what you read is a pretty good idea due to security considerations anyway.) This specifically includes checking offset values, and that they point to valid objects, with valid sizes and of the type and hash value expected. All code must be written with the fact in mind that a file with inconsistent structure file might just be inconsistent temporarily, and might become consistent later on. Payload, on the other hand, requires less scrutiny, as it should only be linked up (and hence visible to readers) after it was successfully written to memory (though not necessarily to disk). On non-local file systems it is a good idea to verify the payload hashes when reading, in order to avoid annoyances with mmap() inconsistencies. <span class="anchor" id="line-185"></span><span class="anchor" id="line-186"></span><p class="line874">Clients intending to show a live view of the journal should use inotify() to watch for files changes. Since file writes done via mmap() do not result in inotify events, writers shall truncate the file to its current size after writing on or more entries, which results in inotify events being generated. Note that this is not used as a transaction scheme (it doesn't protect anything), but merely for triggering wakeups. <span class="anchor" id="line-187"></span><span class="anchor" id="line-188"></span><p class="line874">Note that inotify will not work on network file systems if reader and writer reside on different hosts. Readers which detect they are run on journal files on a non-local file system should hence not rely on inotify for live views but fall back to simple time based polling of the files (maybe recheck every 2s). <span class="anchor" id="line-189"></span><span class="anchor" id="line-190"></span><p class="line867">
 <h2 id="Objects">Objects</h2>
 <span class="anchor" id="line-191"></span><span class="anchor" id="line-192"></span><p class="line874">All objects carry a common header: <span class="anchor" id="line-193"></span><span class="anchor" id="line-194"></span><p class="line867"><span class="anchor" id="line-195"></span><span class="anchor" id="line-196"></span><span class="anchor" id="line-197"></span><span class="anchor" id="line-198"></span><span class="anchor" id="line-199"></span><span class="anchor" id="line-200"></span><span class="anchor" id="line-201"></span><span class="anchor" id="line-202"></span><span class="anchor" id="line-203"></span><span class="anchor" id="line-204"></span><span class="anchor" id="line-205"></span><span class="anchor" id="line-206"></span><pre><span class="anchor" id="line-1-4"></span>enum {
 <span class="anchor" id="line-2-4"></span>        OBJECT_COMPRESSED = 1
@@ -231,7 +231,7 @@
 <span class="anchor" id="line-11-3"></span>        sd_id128_t boot_id;
 <span class="anchor" id="line-12-1"></span>        le64_t xor_hash;
 <span class="anchor" id="line-13-1"></span>        EntryItem items[];
-<span class="anchor" id="line-14-1"></span>};</pre><span class="anchor" id="line-269"></span><span class="anchor" id="line-270"></span><p class="line874">An ENTRY object binds several DATA objects together into one log entry, and includes other meta data such as various timestamps. <span class="anchor" id="line-271"></span><span class="anchor" id="line-272"></span><p class="line862">The <strong>seqnum</strong> field contains the sequence number of the entry, <strong>realtime</strong> the realtime timestamp, and <strong>monotonic</strong> the monotonic timestamp for the boot identified by <strong>boot_id</strong>. <span class="anchor" id="line-273"></span><span class="anchor" id="line-274"></span><p class="line862">The <strong>xor_hash</strong> field contains a binary XOR of the hashes of the payload of all DATA objects referenced by this ENTRY. This value is usable to check the contents of the entry, being independent of the order of the DATA objects in the array. <span class="anchor" id="line-275"></span><span class="anchor" id="line-276"></span><p class="line862">The <strong>items[]</strong> array contains references to all DATA objects of this entry, plus their respective hashes. <span class="anchor" id="line-277"></span><span class="anchor" id="line-278"></span><p class="line874">In the file ENTRY objects are written ordered monotonically by sequence number. For continuous parts of the file written during the same boot (i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo wallclock time jumps (due to incorrect clocks being corrected) the realtime timestamps are monotonic too. <span class="anchor" id="line-279"></span><span class="anchor" id="line-280"></span><p class="line867">
+<span class="anchor" id="line-14-1"></span>};</pre><span class="anchor" id="line-269"></span><span class="anchor" id="line-270"></span><p class="line874">An ENTRY object binds several DATA objects together into one log entry, and includes other meta data such as various timestamps. <span class="anchor" id="line-271"></span><span class="anchor" id="line-272"></span><p class="line862">The <strong>seqnum</strong> field contains the sequence number of the entry, <strong>realtime</strong> the realtime timestamp, and <strong>monotonic</strong> the monotonic timestamp for the boot identified by <strong>boot_id</strong>. <span class="anchor" id="line-273"></span><span class="anchor" id="line-274"></span><p class="line862">The <strong>xor_hash</strong> field contains a binary XOR of the hashes of the payload of all DATA objects referenced by this ENTRY. This value is usable to check the contents of the entry, being independent of the order of the DATA objects in the array. <span class="anchor" id="line-275"></span><span class="anchor" id="line-276"></span><p class="line862">The <strong>items[]</strong> array contains references to all DATA objects of this entry, plus their respective hashes. <span class="anchor" id="line-277"></span><span class="anchor" id="line-278"></span><p class="line874">In the file ENTRY objects are written ordered monotonically by sequence number. For continuous parts of the file written during the same boot (i.e. with the same boot_id) the monotonic timestamp is monotonic too. Except in the case of wallclock time jumps (due to incorrect clocks being corrected) the realtime timestamps are monotonic too. <span class="anchor" id="line-279"></span><span class="anchor" id="line-280"></span><p class="line867">
 <h2 id="Hash_Table_Objects">Hash Table Objects</h2>
 <span class="anchor" id="line-281"></span><span class="anchor" id="line-282"></span><p class="line867"><span class="anchor" id="line-283"></span><span class="anchor" id="line-284"></span><span class="anchor" id="line-285"></span><span class="anchor" id="line-286"></span><span class="anchor" id="line-287"></span><span class="anchor" id="line-288"></span><span class="anchor" id="line-289"></span><span class="anchor" id="line-290"></span><span class="anchor" id="line-291"></span><span class="anchor" id="line-292"></span><pre><span class="anchor" id="line-1-8"></span>_packed_ struct HashItem {
 <span class="anchor" id="line-2-8"></span>        le64_t head_hash_offset;
@@ -241,13 +241,13 @@
 <span class="anchor" id="line-6-8"></span>_packed_ struct HashTableObject {
 <span class="anchor" id="line-7-7"></span>        ObjectHeader object;
 <span class="anchor" id="line-8-5"></span>        HashItem items[];
-<span class="anchor" id="line-9-5"></span>};</pre><span class="anchor" id="line-293"></span><span class="anchor" id="line-294"></span><p class="line874">The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are identical. They implement a simple hash table, which each cell containing offsets to the head and tail of the singly linked list of the DATA and FIELD objects, respectively. DATA's and FIELD's next_hash_offset field are used to chain up the objects. Empty cells have both offsets set to 0. <span class="anchor" id="line-295"></span><span class="anchor" id="line-296"></span><p class="line862">Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE objects. Their payload is directly referred to by the file header in the <strong>data_hash_table_offset</strong>, <strong>data_hash_table_size</strong>, <strong>field_hash_table_offset</strong>, <strong>field_hash_table_size</strong> fields. These offsets do <em>not</em> point to the object headers but directly to the payloads. When a new journal file is created the two hash table objects need to be created right away as first two objects in the stream. <span class="anchor" id="line-297"></span><span class="anchor" id="line-298"></span><p class="line862">If the hash table fill level is increasing over a certain fill level (Learning from Java's Hashtable for example: &gt; 75%), the writer should rotate the file and create a new one.  <span class="anchor" id="line-299"></span><span class="anchor" id="line-300"></span><p class="line874">The DATA_HASH_TABLE should be sized taking into account to the maximum size the file is expected to grow, as configured by the administrator or disk space considerations. The FIELD_HASH_TABLE should be sized to a fixed size, as the number of fields should be pretty static it depends only on developers creativity rather than runtime parameters. <span class="anchor" id="line-301"></span><span class="anchor" id="line-302"></span><p class="line867">
+<span class="anchor" id="line-9-5"></span>};</pre><span class="anchor" id="line-293"></span><span class="anchor" id="line-294"></span><p class="line874">The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are identical. They implement a simple hash table, which each cell containing offsets to the head and tail of the singly linked list of the DATA and FIELD objects, respectively. DATA's and FIELD's next_hash_offset field are used to chain up the objects. Empty cells have both offsets set to 0. <span class="anchor" id="line-295"></span><span class="anchor" id="line-296"></span><p class="line862">Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE objects. Their payload is directly referred to by the file header in the <strong>data_hash_table_offset</strong>, <strong>data_hash_table_size</strong>, <strong>field_hash_table_offset</strong>, <strong>field_hash_table_size</strong> fields. These offsets do <em>not</em> point to the object headers but directly to the payloads. When a new journal file is created the two hash table objects need to be created right away as first two objects in the stream. <span class="anchor" id="line-297"></span><span class="anchor" id="line-298"></span><p class="line862">If the hash table fill level has increased over a certain fill level (Learning from Java's Hashtable for example: &gt; 75%), the writer should rotate the file and create a new one.  <span class="anchor" id="line-299"></span><span class="anchor" id="line-300"></span><p class="line874">The DATA_HASH_TABLE should be sized taking into account to the maximum size the file is expected to grow, as configured by the administrator or disk space considerations. The FIELD_HASH_TABLE should be sized to a fixed size, as the number of fields should be pretty static, depending only on developers' creativity rather than runtime parameters. <span class="anchor" id="line-301"></span><span class="anchor" id="line-302"></span><p class="line867">
 <h2 id="Entry_Array_Objects">Entry Array Objects</h2>
 <span class="anchor" id="line-303"></span><span class="anchor" id="line-304"></span><p class="line867"><span class="anchor" id="line-305"></span><span class="anchor" id="line-306"></span><span class="anchor" id="line-307"></span><span class="anchor" id="line-308"></span><span class="anchor" id="line-309"></span><span class="anchor" id="line-310"></span><pre><span class="anchor" id="line-1-9"></span>_packed_ struct EntryArrayObject {
 <span class="anchor" id="line-2-9"></span>        ObjectHeader object;
 <span class="anchor" id="line-3-9"></span>        le64_t next_entry_array_offset;
 <span class="anchor" id="line-4-9"></span>        le64_t items[];
-<span class="anchor" id="line-5-9"></span>};</pre><span class="anchor" id="line-311"></span><span class="anchor" id="line-312"></span><p class="line874">Entry Arrays are used to store a sorted array of offsets to entries. Entry arrays are strictly sorted by offsets on disk, and hence by their timestamps and sequence numbers (with some restrictions, see above). <span class="anchor" id="line-313"></span><span class="anchor" id="line-314"></span><p class="line862">Entry Arrays are chained up. If one entry array is full another one is allocated and the <strong>next_entry_array_offset</strong> field of the old one pointed to it. An Entry Array with <strong>next_entry_array_offset</strong> set to 0 is the last in the list. To optimize allocation and seeking, as entry arrays are appended to a chain of entry arrays they should increase in size (double). <span class="anchor" id="line-315"></span><span class="anchor" id="line-316"></span><p class="line874">Due to being monotonically ordered entry arrays may be searched with a binary search (bisection). <span class="anchor" id="line-317"></span><span class="anchor" id="line-318"></span><p class="line862">One chain of entry arrays links up all entries written to the journal. The first entry array is referenced in the <strong>entry_array_offset</strong> field of the header. <span class="anchor" id="line-319"></span><span class="anchor" id="line-320"></span><p class="line874">Each DATA object also references an entry array chain listing all entries referencing a specific DATA object. Since many DATA objects are only referenced by a single ENTRY the first offset of the list is stored inside the DATA object itself, an ENTRY_ARRAY object is only needed if it is referenced by more than one ENTRY. <span class="anchor" id="line-321"></span><span class="anchor" id="line-322"></span><p class="line867">
+<span class="anchor" id="line-5-9"></span>};</pre><span class="anchor" id="line-311"></span><span class="anchor" id="line-312"></span><p class="line874">Entry Arrays are used to store a sorted array of offsets to entries. Entry arrays are strictly sorted by offsets on disk, and hence by their timestamps and sequence numbers (with some restrictions, see above). <span class="anchor" id="line-313"></span><span class="anchor" id="line-314"></span><p class="line862">Entry Arrays are chained up. If one entry array is full another one is allocated and the <strong>next_entry_array_offset</strong> field of the old one pointed to it. An Entry Array with <strong>next_entry_array_offset</strong> set to 0 is the last in the list. To optimize allocation and seeking, as entry arrays are appended to a chain of entry arrays they should increase in size (double). <span class="anchor" id="line-315"></span><span class="anchor" id="line-316"></span><p class="line874">Due to being monotonically ordered, entry arrays may be searched with a binary search (bisection). <span class="anchor" id="line-317"></span><span class="anchor" id="line-318"></span><p class="line862">One chain of entry arrays links up all entries written to the journal. The first entry array is referenced in the <strong>entry_array_offset</strong> field of the header. <span class="anchor" id="line-319"></span><span class="anchor" id="line-320"></span><p class="line874">Each DATA object also references an entry array chain listing all entries referencing a specific DATA object. Since many DATA objects are only referenced by a single ENTRY, the first offset of the list is stored inside the DATA object itself, and an ENTRY_ARRAY object is only needed if it is referenced by more than one ENTRY. <span class="anchor" id="line-321"></span><span class="anchor" id="line-322"></span><p class="line867">
 <h2 id="Tag_Object">Tag Object</h2>
 <span class="anchor" id="line-323"></span><span class="anchor" id="line-324"></span><p class="line867"><span class="anchor" id="line-325"></span><span class="anchor" id="line-326"></span><span class="anchor" id="line-327"></span><span class="anchor" id="line-328"></span><span class="anchor" id="line-329"></span><span class="anchor" id="line-330"></span><span class="anchor" id="line-331"></span><span class="anchor" id="line-332"></span><span class="anchor" id="line-333"></span><pre><span class="anchor" id="line-1-10"></span>#define TAG_LENGTH (256/8)
 <span class="anchor" id="line-2-10"></span>
@@ -256,9 +256,9 @@
 <span class="anchor" id="line-5-10"></span>        le64_t seqnum;
 <span class="anchor" id="line-6-9"></span>        le64_t epoch;
 <span class="anchor" id="line-7-8"></span>        uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */
-<span class="anchor" id="line-8-6"></span>};</pre><span class="anchor" id="line-334"></span><span class="anchor" id="line-335"></span><p class="line862">Tag objects are used to seal off the journal for alteration. In regular intervals a tag object is appended to the file. The tag object consists of a SHA-256 HMAC tag that is calculated from the objects stored in the file since the last tag was written, or from the beginning if not tag was written yet. The key for the HMAC is calculated via the externally maintained FSPRG logic for the epoch that is written into <strong>epoch</strong>. The sequence number <strong>seqnum</strong> is increased with each tag. When calculating the HMAC of objects header fields that are volatile are excluded (skipped). More specifically all fields that might validly be altered to maintain a consistent file structure (such as offsets to objects added later for the purpose of linked lists and suchlike) after an object has been written are not protected by the tag. This means a verifier has to independently check these fields for consistency of structure. For the fields excluded from the HMAC please consult the source code directly. A verifier should read the file from the beginning to the end, always calculating the HMAC for the objects it reads. Each time a tag object is encountered the HMAC should be verified and restarted. The tag object sequence numbers need to increase strictly monotonically. Tag objects themselves are partially protected by the HMAC (i.e. seqnum and epoch is included, the tag itself not). <span class="anchor" id="line-336"></span><span class="anchor" id="line-337"></span><p class="line867">
+<span class="anchor" id="line-8-6"></span>};</pre><span class="anchor" id="line-334"></span><span class="anchor" id="line-335"></span><p class="line862">Tag objects are used to seal off the journal for alteration. In regular intervals a tag object is appended to the file. The tag object consists of a SHA-256 HMAC tag that is calculated from the objects stored in the file since the last tag was written, or from the beginning if not tag was written yet. The key for the HMAC is calculated via the externally maintained FSPRG logic for the epoch that is written into <strong>epoch</strong>. The sequence number <strong>seqnum</strong> is increased with each tag. When calculating the HMAC of objects header fields that are volatile are excluded (skipped). More specifically, all fields that might validly be altered to maintain a consistent file structure (such as offsets to objects added later for the purpose of linked lists and suchlike) after an object has been written are not protected by the tag. This means a verifier has to independently check these fields for consistency of structure. For the fields excluded from the HMAC please consult the source code directly. A verifier should read the file from the beginning to the end, always calculating the HMAC for the objects it reads. Each time a tag object is encountered the HMAC should be verified and restarted. The tag object sequence numbers need to increase strictly monotonically. Tag objects themselves are partially protected by the HMAC (i.e. seqnum and epoch is included, the tag itself not). <span class="anchor" id="line-336"></span><span class="anchor" id="line-337"></span><p class="line867">
 <h2 id="Algorithms">Algorithms</h2>
-<span class="anchor" id="line-338"></span><span class="anchor" id="line-339"></span><p class="line867"><em>Reading:</em> <span class="anchor" id="line-340"></span><span class="anchor" id="line-341"></span><p class="line874">Given an offset to an entry all data fields are easily found by following the offsets in the data item array of the entry.  <span class="anchor" id="line-342"></span><span class="anchor" id="line-343"></span><p class="line862">Listing entries without filter is done by traversing the list of entry arrays starting with the headers' <strong>entry_array_offset</strong> field. <span class="anchor" id="line-344"></span><span class="anchor" id="line-345"></span><p class="line862">Seeking to an entry by timestamp or sequence number (without any matches) is done via binary search in the entry arrays starting with the header's <strong>entry_array_offset</strong> field. Since these arrays double in size as more are added the time cost of seeking is O(log(n)*log(n)) if n is the number of entries in the file. <span class="anchor" id="line-346"></span><span class="anchor" id="line-347"></span><p class="line874">When seeking or listing with one field match applied the DATA object of the match is first identified, and then its data entry array chain traversed. The time cost is the same as for seeks/listings with no match. <span class="anchor" id="line-348"></span><span class="anchor" id="line-349"></span><p class="line862">If multiple matches are applied, multiple chains of entry arrays should be traversed in parallel. Since they all are strictly monotonically ordered by offset of the entries, advancing in one can be directly applied to the others, until a entry matching all matches is found. In the worst case seeking like this is O(n) where n is the number of matching entries of the "loosest" match, but in the common case should be much more efficient at least for the well-known fields, where the set of possible field values tend to be closely related. Checking whether an entry matches a number of matches is efficient since the item array of the entry contains hashes of all data fields referenced, and the number of data fields of an entry is generally small (&lt; 30). <span class="anchor" id="line-350"></span><span class="anchor" id="line-351"></span><p class="line874">When interleaving multiple journal files seeking tends to be a frequently used operation, but in this case can be effectively suppressed by caching results from previous entries. <span class="anchor" id="line-352"></span><span class="anchor" id="line-353"></span><p class="line874">When listing all possible values a certain field can take it is sufficient to look up the FIELD object and follow the chain of links to all DATA it includes. <span class="anchor" id="line-354"></span><span class="anchor" id="line-355"></span><p class="line867"><em>Writing:</em> <span class="anchor" id="line-356"></span><span class="anchor" id="line-357"></span><p class="line874">When an entry is appended to the journal for each of its data fields the data hash table should be checked. If the data field is not yet existing in the file it should be appended and added to the data hash table. When a field data object is added the field hash table should be checked for the field name of the data field, and a field object be added if necessary. After all data fields (and recursively all field names) of the new entry are appended and linked up in the hashtables the entry object should be appended and linked up too. <span class="anchor" id="line-358"></span><span class="anchor" id="line-359"></span><p class="line874">In regular intervals a tag object should be written if sealing is enabled (see above). Before the file is closed a tag should be written too, to seal it off. <span class="anchor" id="line-360"></span><span class="anchor" id="line-361"></span><p class="line874">Before writing an object, time and disk space limits should be checked and rotation triggered if necessary. <span class="anchor" id="line-362"></span><span class="anchor" id="line-363"></span><p class="line867">
+<span class="anchor" id="line-338"></span><span class="anchor" id="line-339"></span><p class="line867"><em>Reading:</em> <span class="anchor" id="line-340"></span><span class="anchor" id="line-341"></span><p class="line874">Given an offset to an entry all data fields are easily found by following the offsets in the data item array of the entry.  <span class="anchor" id="line-342"></span><span class="anchor" id="line-343"></span><p class="line862">Listing entries without filter is done by traversing the list of entry arrays starting with the headers' <strong>entry_array_offset</strong> field. <span class="anchor" id="line-344"></span><span class="anchor" id="line-345"></span><p class="line862">Seeking to an entry by timestamp or sequence number (without any matches) is done via binary search in the entry arrays starting with the header's <strong>entry_array_offset</strong> field. Since these arrays double in size as more are added, the time cost of seeking is O(log(n)*log(n)) if n is the number of entries in the file. <span class="anchor" id="line-346"></span><span class="anchor" id="line-347"></span><p class="line874">When seeking or listing with one field match applied the DATA object of the match is first identified, and then its data entry array chain traversed. The time cost is the same as for seeks/listings with no match. <span class="anchor" id="line-348"></span><span class="anchor" id="line-349"></span><p class="line862">If multiple matches are applied, multiple chains of entry arrays should be traversed in parallel. Since they all are strictly monotonically ordered by offset of the entries, advancing in one can be directly applied to the others, until a entry matching all matches is found. In the worst case seeking like this is O(n) where n is the number of matching entries of the "loosest" match, but in the common case should be much more efficient at least for the well-known fields, where the set of possible field values tend to be closely related. Checking whether an entry matches a number of matches is efficient since the item array of the entry contains hashes of all data fields referenced, and the number of data fields of an entry is generally small (&lt; 30). <span class="anchor" id="line-350"></span><span class="anchor" id="line-351"></span><p class="line874">When interleaving multiple journal files seeking tends to be a frequently used operation, but in this case can be effectively suppressed by caching results from previous entries. <span class="anchor" id="line-352"></span><span class="anchor" id="line-353"></span><p class="line874">When listing all possible values a certain field can take it is sufficient to look up the FIELD object and follow the chain of links to all DATA it includes. <span class="anchor" id="line-354"></span><span class="anchor" id="line-355"></span><p class="line867"><em>Writing:</em> <span class="anchor" id="line-356"></span><span class="anchor" id="line-357"></span><p class="line874">When an entry is appended to the journal for each of its data fields, the data hash table should be checked. If the data field does not yet exist in the file it should be appended and added to the data hash table. When a field data object is added, the field hash table should be checked for the field name of the data field, and a field object be added if necessary. After all data fields (and recursively all field names) of the new entry are appended and linked up in the hashtables, the entry object should be appended and linked up too. <span class="anchor" id="line-358"></span><span class="anchor" id="line-359"></span><p class="line874">In regular intervals a tag object should be written if sealing is enabled (see above). Before the file is closed a tag should be written too, to seal it off. <span class="anchor" id="line-360"></span><span class="anchor" id="line-361"></span><p class="line874">Before writing an object, time and disk space limits should be checked and rotation triggered if necessary. <span class="anchor" id="line-362"></span><span class="anchor" id="line-363"></span><p class="line867">
 <h2 id="Optimizing_Disk_IO">Optimizing Disk IO</h2>
 <span class="anchor" id="line-364"></span><span class="anchor" id="line-365"></span><p class="line867"><em>A few general ideas to keep in mind:</em> <span class="anchor" id="line-366"></span><span class="anchor" id="line-367"></span><p class="line874">The hash tables for looking up fields and data should be quickly in the memory cache and not hurt performance. All entries and entry arrays are ordered strictly by time on disk, and hence should expose an OK access pattern on rotating media, when read sequentially (which should be the most common case, given the nature of log data). <span class="anchor" id="line-368"></span><span class="anchor" id="line-369"></span><p class="line874">The disk access patterns of the binary search for entries needed for seeking are problematic on rotating disks. This should not be a major issue though, since seeking should not be a frequent operation. <span class="anchor" id="line-370"></span><span class="anchor" id="line-371"></span><p class="line874">When reading, collecting data fields for presenting entries to the user is problematic on rotating disks. In order to optimize these patterns the item array of entry objects should be sorted by disk offset before writing. Effectively, frequently used data objects should be in the memory cache quickly. Non-frequently used data objects are likely to be located between the previous and current entry when reading and hence should expose an OK access pattern. Problematic are data objects that are neither frequently nor infrequently referenced, which will cost seek time. <span class="anchor" id="line-372"></span><span class="anchor" id="line-373"></span><p class="line874">And that's all there is to it. <span class="anchor" id="line-374"></span><span class="anchor" id="line-375"></span><p class="line874">Thanks for your interest! <span class="anchor" id="line-376"></span><span class="anchor" id="bottom"></span></div><div id="pagebottom"></div>
 </div>






-- 
Matthew Miller  ☁☁☁  Fedora Cloud Architect  ☁☁☁  <mattdm at fedoraproject.org>


More information about the devel mailing list