On 29 June 2015 at 02:38, jd1008 <jd1008@xxxxxxxxx> wrote: > > > On 06/28/2015 07:02 PM, Stephen Davies wrote: >> >> >> On 29/06/15 10:13, jd1008 wrote: >>> >>> >>> >>> On 06/28/2015 06:38 PM, jd1008 wrote: >>>> >>>> Hi, >>>> I have text files made of paragraphs of text, separated by >>>> blank lines. >>>> >>>> Each "paragraph" is information about a different item. >>>> >>>> In need to sort these paragraphs based on the first line >>>> of each paragraph. >>>> >>>> Need some hints how to accomplish this. >>>> >>>> Thanx. >>> >>> Forgot to say that each paragraph is made of multiple lines, >>> but a paragraph's lines do not contain a blank line. >> >> I would just concatenate lines until the blank is reached then write out >> the concatenated line. >> The result can then be sorted. >> >> If you want to revert the result to paragraphs, just reverse the process >> outputting lines of up to N characters ending in a space. >> >> HTH, >> Stephen >> > Too much work to break the one line back into multiple lines because the > lines are of different lengths. > Too many files also. Also, to keep original lines of a paragraph unmangled, > I would have to first > do something like append each line of a paragraph with a delineating > character to be used by something > like sed to change that character into a newline. > Actually, you'd want to replace newlines, except blank ones with a token, so you can sort your paragraphs, then change that token back to a newline afterwards. This gives you a problem, the token has to be in the space of valid characters if you want to do this as a stream, so actually you need to translate it into an escaped string or something similar to do this cleanly without risking turning a valid character into a newline by mistake. There are three other things you can do, one is, as Stephen suggests, treat paragraphs as single lines and re-wrap them on output (the fold command can help here), if preserving exact line breaks doesn't matter. Two is using awk, you could get it to output a sort token at the start of each line, this would be something like the paragraph first word and a paragraph line number (possibly also a paragraph count to avoid mixing lines from different paragraphs with the same first word), you'd reset the paragraph line number at each blank line and increment the paragraph count. Your print would look something like: print firstword" "paranum" "paraline" "$0 Then you have to undo it at the other end, this is a rough attempt: cat somefile|awk ' BEGIN {paranum=1 ; paraline=1 } /^$/{paraline=1; paranum++} /^.+$/{if (paraline==1) { firstword=$1 } ; print firstword" "paranum" "paraline" "$0 ; paraline++}'|sort -k1,3 -n|awk '{if ( $3==1) print "" ; for (i=1; i<NF-2; i++) $i=$(i+3) ; NF=NF-3; print $0}' ... but it's limited since actually you may want to sort not by the first word, but by the first few words of each paragraph, and it will trash whitespace within the lines (it's the recombining that does this, the second awk, you could replace it with something more sophisticated). Option 3. The trouble is that awk (and sed) really operate on a per-line basis. Neither really even allows sorting of input, and the sort command is also per line. Awk gives you sufficient programming tools that you could solve this problem if you were willing to get very complicated. However once you're dealing with multi-line text you need better data structures. Using perl or python you could quite easily load paragraphs as arrays of lines, and put those together into a bigger array that can be sorted according to first lines. It wouldn't be much of an extension either to use instead objects containing the full paragraph text as a single string together with the line array and then be able to sort on the full paragraph text. -- imalone http://ibmalone.blogspot.co.uk -- users mailing list users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org