Re: [PATCH] send-email: Fix Pine address book parsing

Trent Piepho <tpiepho@xxxxxxxxxxxxx> · Tue, 25 Nov 2008 21:59:45 -0800 (PST)

On Tue, 25 Nov 2008, Junio C Hamano wrote:
> Trent Piepho <tpiepho@xxxxxxxxxxxxx> writes:
>> See:  http://www.washington.edu/pine/tech-notes/low-level.html
>>
>> Entries with a fcc or comment field after the address weren't parsed
>> correctly.
>>
>> Continuation lines, identified by leading spaces, were also not handled.
>>
>> Distribution lists which had ( ) around a list of addresses did not have
>> the parenthesis removed.
>
>> +	pine => sub { my $fh = shift; my $f='\t[^\t]*';
>> +	        for (my $x = ''; defined($x); $x = $_) {
>> +			chomp $x;
>> +		        $x .= $1 while(defined($_ = <$fh>) && /^ +(.*)$/);
>> +			$x =~ /^(\S+)$f\t\(?([^\t]+?)\)?(:?$f){0,2}$/ or next;
>
> Hmm, so you chomp each continuation line with /^ +(.*)$/ and concatenate
> that to the hold buffer ($x) as long as you see continuation lines,
> a non-continuation line that you read ahead is given to the next round
> (the third part of for(;;) control), checked if you hit an EOF and then
> chomped.  Which means the complicated regexp about the parentheses is
> applied to a logical single line in $x that does not have any newline in
> it, right?

Yes.  The previous regex would just grab the email address with (\S+)$, but
that's not right.  There can be email address with spaces in them, like
"John Doe <jdoe@xxxxxxxx>".  And the email address isn't always the last
field.  So each field has to be put in the regex and \S+ and \s* have to
become [^\t]* and \t to count fields properly.  That's why the regex got so
complex.

> I wonder what this does:
>
> 	$x .= $1 while (defined($_ = <$fh>) && /^ +(.*)$/);
>
> when you have "a b" in $x and feed " c\n d\ne\n" to it.  When it leaves
> the loop, you would have "e\n" in $_ for the next round, and "a bcd" (note
> that "bcd" becomes one word) in $x, which I suspect may not be what you
> want.

The tech docs I linked to just say pine continues lines with leading space,
but not how many spaces exactly.  From what I can see it appears to usually
use three spaces, but sometimes it uses one space when wrapping a very long
comment field.  It also appears to only split lines between whitespace and
non-whitespace.  So if "a b c d\n" were to be wrapped, it would be something
like "a b \n   c \n   d\n".  If I didn't eat the leading spaces in the
continuations, it would be re-assembled as "a b    c    d".  This might cause
an address to become "John     Doe <jdoe@xxxxxxxx>"
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html