Awk and sort (of text files)

Mon Jun 29 12:05:01 UTC 2015

On 29 June 2015 at 02:38, jd1008 <jd1008 at gmail.com> wrote:
>
>
> On 06/28/2015 07:02 PM, Stephen Davies wrote:
>>
>>
>> On 29/06/15 10:13, jd1008 wrote:
>>>
>>>
>>>
>>> On 06/28/2015 06:38 PM, jd1008 wrote:
>>>>
>>>> Hi,
>>>> I have text files made of paragraphs of text, separated by
>>>> blank lines.
>>>>
>>>> Each "paragraph" is information about a different item.
>>>>
>>>> In need to sort these paragraphs based on the first line
>>>> of each paragraph.
>>>>
>>>> Need some hints how to accomplish this.
>>>>
>>>> Thanx.
>>>
>>> Forgot to say that each paragraph is made of multiple lines,
>>> but a paragraph's lines do not contain a blank line.
>>
>> I would just concatenate lines until the blank is reached then write out
>> the concatenated line.
>> The result can then be sorted.
>>
>> If you want to revert the result to paragraphs, just reverse the process
>> outputting lines of up to N characters ending in a space.
>>
>> HTH,
>> Stephen
>>
> Too much work to break the one line back into multiple lines because the
> lines are of different lengths.
> Too many files also. Also, to keep original lines of a paragraph unmangled,
> I would have to first
> do something like append each line of a paragraph with a delineating
> character to be used by something
> like sed to change that character into a newline.
>

Actually, you'd want to replace newlines, except blank ones with a
token, so you can sort your paragraphs, then change that token back to
a newline afterwards. This gives you a problem, the token has to be in
the space of valid characters if you want to do this as a stream, so
actually you need to translate it into an escaped string or something
similar to do this cleanly without risking turning a valid character
into a newline by mistake.

There are three other things you can do, one is, as Stephen suggests,
treat paragraphs as single lines and re-wrap them on output (the fold
command can help here), if preserving exact line breaks doesn't
matter. Two is using awk, you could get it to output a sort token at
the start of each line, this would be something like the paragraph
first word and a paragraph line number (possibly also a paragraph
count to avoid mixing lines from different paragraphs with the same
first word), you'd reset the paragraph line number at each blank line
and increment the paragraph count. Your print would look something
like:
print firstword" "paranum" "paraline" "$0
Then you have to undo it at the other end, this is a rough attempt:
cat somefile|awk ' BEGIN {paranum=1 ; paraline=1 } /^$/{paraline=1;
paranum++} /^.+$/{if (paraline==1) { firstword=$1 } ; print firstword"
"paranum" "paraline" "$0 ; paraline++}'|sort -k1,3 -n|awk '{if (
$3==1) print "" ; for (i=1; i<NF-2; i++) $i=$(i+3) ; NF=NF-3;  print
$0}'

... but it's limited since actually you may want to sort not by the
first word, but by the first few words of each paragraph, and it will
trash whitespace within the lines (it's the recombining that does
this, the second awk, you could replace it with something more
sophisticated).

Option 3. The trouble is that awk (and sed) really operate on a
per-line basis. Neither really even allows sorting of input, and the
sort command is also per line. Awk gives you sufficient programming
tools that you could solve this problem if you were willing to get
very complicated. However once you're dealing with multi-line text you
need better data structures. Using perl or python you could quite
easily load paragraphs as arrays of lines, and put those together into
a bigger array that can be sorted according to first lines. It
wouldn't be much of an extension either to use instead objects
containing the full paragraph text as a single string together with
the line array and then be able to sort on the full paragraph text.

-- 
imalone
http://ibmalone.blogspot.co.uk