Awk and sort (of text files)

jd1008 jd1008 at gmail.com
Mon Jun 29 18:57:12 UTC 2015


On 06/29/2015 11:48 AM, Bill Oliver wrote:
> On Mon, 29 Jun 2015, jd1008 wrote:
>
>>
>>
>> On 06/29/2015 03:39 AM, Dario Lesca wrote:
>>>  Il giorno dom, 28/06/2015 alle 18.38 -0600, jd1008 ha scritto:
>>> >  Hi,
>>> >  I have text files made of paragraphs of text, separated by
>>> >  blank lines.
>>> > >  Each "paragraph" is information about a different item.
>>> > >  In need to sort these paragraphs based on the first line
>>> >  of each paragraph.
>>> > >  Need some hints how to accomplish this.
>>> > >  Thanx.
>>>  An example of your text file can help us to help you.
>>>
>> I described them perfectly.
>> text paragraphs made of a few or several lines.
>>
>> The paragraphs are separated by an empty line.
>>
>>
>
> Try something like this.  It's buggy, but what can you expect for 5
> minutes of work.
>
> This takes a text with lines separated by hard breaks, and an empty line
> between paragraphs, and sorts it.
>
> Here are the obvious problems I haven't bothered to debug:
>
> 1) I counts the empty lines as paragraphs, so you get blank space at the
> top.
>
> 2) I'm doing something wrong with asort (see comment).
>
> 3) It looks like I'm sorting twice -- once with asort, and then to
> reindex.  There should be a smart way to do this.
>
> Here's the awk code:
>
>
> BEGIN{newparagraph=0; numlines=0; paranum=0;}
>
>         {
>         #if the line is blank, it's time to start a new paragraph
>         if ($0==""){
>                 paranum++;
>                 numlines=0;
>                 }
>         #if it's not blank, buffer it
>         else {
>                 numl[paranum]=numlines;
>                 paragraph[paranum][numlines++] = $0;
>                 }
>
>         }
>
>
> END{
>
>         for (i=0;i<=paranum;i++){
>                 firstline[i] = paragraph[i][0]
>
>                 }
>
>         #for a reason I don't understand, "sorted" has one index more 
> than firstline!?
>         #I'm probably making some mistake with starting with 0 vs 1, 
> but I'm not going to fix it.
>         # so, I'll just increment paranum, because I'm lazy
>         asort(firstline,sorted);
>         paranum++
>
>
>
>         #Renumber the indices
>         for (i=0;i<=paranum;i++){
>
>                 found=0;
>                 newindex[i] = 999;
>                 for(j=0;((j<=paranum) && (found==0));j++){
>                         if(sorted[i] == firstline[j]){
>                                 newindex[i]=j;
>                                 found=1;
>                                 }
>                 }
>
>         }
>
>         #print it out
>         for(i=0;i<=paranum;i++){
>
>                 current_paragraph = newindex[i];
>                 new_numlines = numl[current_paragraph];
>
>                 for (j=0;j<=new_numlines;j++){
>                         print (paragraph[newindex[i]][j]);
>                         }
>                 print("");
>                 }
>
>         }
>
Here is the simplest solution and it does what I want without resorting 
to awk:
for i in `/bin/ls -1 lists*`; do
sed '/./{H;d;};x;s/\n/={NL}=/g' $i | sort | sed 
'1s/={NL}=//;s/={NL}=/\n/g' > $i.sorted.txt
done




More information about the users mailing list