
Sample solutions and discussion
Perl Quiz of The Week #1 (20021016)

----------------------------------------------------------------

        Write a function, 'center', whose argument is a list of strings, which
        will be lines of text.  'center' should insert spaces at the beginning
        of the lines of text so that if they were printed, the text would be
        centered, and return the modified lines.

        For example,

                center("This", "is", "a test of the", "center function");

        should return the list:

                "     This", "      is", " a test of the", "center function"

        because if these lines were printed, they would look like:

                     This
                      is
                 a test of the
                center function


I had in mind a solution something like this one:

        sub center {
          my $maxwidth = 0;
          my @s = @_;
          for (@s){
            $maxwidth = length($_) if length($_) > $maxwidth;
          }
          for (@s) {
            $_ = " " x (($maxwidth - length($_))/2) . $_;
          }
          @s;
        }

Most of the solutions posted on the perl-qotw-discuss list did look
very much like this.  Peter Haworth used 'substr' to insert the spaces
at the beginning of the lines.  Using this technique in the code above
would replace the "$_ = ..." line with:

            substr($_, 0, 0) = " " x ($maxwidth - length($_))/2;

Other notes:

1. A lot of people seem not to understand what it means to "write a
   function which will return [some value]".  Several people posted
   functions which would *print* centered text; this was not what the
   function was supposed to do.  This always surprises me when I'm
   doing training classes, and it continued to suprise me when it came
   up this week.

   A function which prints out centered text is only about one
   one-hundredth as useful as a function which centers strings and
   returns them.  If the function returns the values without printing
   them, the centered values can be modified and printed later; they
   can be incorporated into some larger data structure; they can be
   printed to STDERR as part of a warning message or to a file; they
   can be sent into some other command via a pipe; they can be sent
   over the network; they can be analyzed by some other part of the
   program and then discarded.  A function which prints out the lines
   cannot be used in any of these ways.

   If I were teaching a class in how to be a professional programmer,
   I would mention this in the first week and then harp on it every
   day for the next two years.  Most of what a programmer does is to
   design functions to be used by other programmers, and one of the
   biggest obstacles to making good software is that many functions
   are designed with broken interfaces so that they can't be used in
   more than one way.  Functions should almost never print *anything*
   out.

2. Several people asked how wide a line the text should be centered
   in.  The problem statement makes no remark about centering the text
   within a fixed-width line.  In particular, the example should make
   clear that the text is not to be centered within an 80-character
   line, since if it were then the sample output would look like this:

        "                                      this", 
        "                                       is", 
        "                                 a test of the", 
        "                                center function", 

   But it doesn't; as I said in the original message, it looks like this:

        "     This", "      is", " a test of the", "center function"

   Assuming that the output device is always 80 characters wide was a
   bad practice even in the 1970s when there was a physical reality
   that underlay it.  Now, when I can have a 93-character-wide
   terminal at the touch of a button, it makes no sense at all.

   If you did decide to center the text within a fixed-width line, you
   should have provided the function with a parameter to determine the
   line width, perhaps something like this:

        sub center {
          my $linewidth = 80;
          if (ref $_[0]) {
            my $args = shift;
            $linewidth = $args->{LINEWIDTH};
          }

          ...
        }

   Then the user has the option of calling 'center(...)' to center the
   text within your default-width line, or 'center({LINEWIDTH => 17}, ...)'
   to choose a different width.

3. Here's one minor point I found puzzling.  Several people wrote
   expressions like this one:

         ( $maxlen / 2 ) - ( length( $_ ) / 2 )


   When I see something like this, I usually want to rewrite it as

        ( $maxlen - length ( $_ ) ) / 2

   unless for some reason the first one is a lot more perspicuous.
   Here the second one seems simpler to me: How many spaces?  Take the
   length of the line, and subtract that from the width of the line;
   that says how much extra space is on the line; the amount of space
   to the left of the text is exactly half the extra space.

   For some reason this reminds me of a time when my aunt came to me
   to ask me for help with a math problem.  She wanted a method for
   finding the number halfway in between two other numbers.  Her
   method was to find the difference between the two numbers, divide
   it by 2, and then add the result back to the smaller number.  She
   wanted to know if there was an easier way.

   There were two things about this that surprised me.  The first was
   that she did not recognize that the average of two numbers is
   exactly the same as the number that is halfway in between them.
   The other surprsising thing is that in spite of years of
   mathematical training in high school and college, she was not able
   to frame her original method algebraically, discover that she was
   computing the expression x + (y-x)/2, and then reduce this to 
   2x/2 + (y-x)/2, to (2x+y-x)/2, and finally to (x+y)/2.  I don't
   know what the point of this story is, except perhaps that something
   similar is going on with the (maxlen/2)-(length/2) expressions.

   A couple of people wrote something like:

         ' ' x ($max - length($_) / 2)

   apparently not realizing that this inserts far too many spaces.

4. Steve Smoot pointed out something I hadn't considered: The input
   strings themselves might contain newlines.  The problem statement
   seems to preclude this, since it says that the argument strings
   "will be lines of text".  But if you want to do something
   reasonable, it's quite easy:

        sub center {
          my @lines = map split /\n/, @_;
          # now adjust @lines...
        }

5. If a line couldn't be exactly centered in the space available, most
   people just shifted it a half-space to the left.  Some people used
   'int' to throw away the remainder after dividing by 2; some just
   took advantage of the implicit behavior of the 'x' operator to do
   the same thing.  One person used 'use integer', which I think might
   be risky.  'integer' doesn't just mean 'integer'; it really means
   'use the underlying C semantics for your operators', and so it
   may also change the behavior of operators like '&' and '%'.

6. Some people were tempted to use the centering in Perl's built-in
   'format' feature.  This seemed to be more trouble than it was
   worth.  One such solution went like this:

        sub center
        {
                eval  ("format STDOUT = \n" .
         ( "@" . ( "|" x (length (join "", @_ )))) .   "\n" . '$_' . "\n.\n\n" );
                for(@_) {write}
        }

   The idea here is to build up a format definition of a format that
   looks like this:

        format STDOUT = 
        @|||||||||||||||||||||||||
        $_
        .

   The '@|||||' tells the 'format' system that you want text centered.

   There are a number of defects here.  One, easy to correct, is that
   it centers the text within a column whose width is equal to the
   *sub* of lengths of the input.  This means that if you ask it to
   center fifty strings of ten characters each, the strings each get
   245 spaces appended to the beginning.  Another defect is that the
   centered data is printed, instead of being returned as a list.  A
   more problematic defect is that if there was a 'STDOUT' format
   before, the subroutine has destroyed it.

   A better approach when dealing with dynamic formats is to use the
   'formline' function, which provides access to the same  internal
   formatting functions that Perl uses.  Ron Isaacson did this, basing
   a solution on an 'swrite' function.  'swrite' formats data in the
   same way that Perl's 'format' feature would have, but returns the
   resulting string instead of printing it.

          use Carp;
          sub swrite {
            croak "usage: swrite PICTURE ARGS" unless @_;
            my $format = shift;
            $^A = "";
            formline($format,@_);
            return $^A;
          }

          sub center {
            my @in = @_;

            my $len = (sort {$b <=> $a} map (length, @in))[0];
            my $format = '@' . '|' x $len;

            map { swrite ($format, $_) } @in;
          }

   This works, but it does seem to use a lot of code and obscure
   features in proportion to the amount of work it does.

7. Randy J. Ray wondered if there wasn't any way to get the result
   with a single pass over the argument list instead of two passes.
   If there was, nobody found it.

8. Aaron Crane pointed out that instead of scanning the arguments to
   find the longest one, you could use the List::Util::max function.

9. As Tom Phoenix pointed out,
   Thereisneitherbonusnorhonoraccruedforomittingmostofthewhitespace.
   Perl Golf is three doors down on the left; the obfuscated contest
   is at the end of the hall down the stairs, and watch that first
   step; it's a doozy.  I've decided it's too much trouble to decipher
   obfuscated solutions, so there may have been points of interest in
   some of them, but I don't really care.


Thanks to everyone who has subscribed to this list, and to everyone
who participated in the discussion.  I'll send another quiz on
Wednesday.

Sample solutions and discussion
Perl Quiz of The Week #2 (20021023)


        Write a function, days_diff, to compute the time difference, in days,
        between two dates.  The dates will be strings in the format

                Wed Oct 16 2002

        For example:

                days_diff("Wed Oct 16 2002", "Wed Oct 23 2002")

        should return 7.

                days_diff("Wed Oct 16 2002", "Tue Oct 16 2001")

        should return -365.       


I thought this would be an easy problem.  But as I should have
remembered, almost nothing to do with date calculations is easy or
simple.  Some of the varied complications are discussed below.

I had originally imagined two types of solution.  One might use one of
the heavy-duty CPAN date calculation modules, such as Date::Calc or
Date::Manip; the other other use the standard Time::Local module.

The Time::Local solution I produced looked like this:

        use Time::Local 'timegm';

        my @mon = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
        my %m2n = map {$mon[$_] => $_} 0 .. 11;

        sub days_diff {
          my @times;
          for (0 .. 1) {
            my ($dw, $mn, $dm, $yr) = split /\s+/, $_[$_];
            push @times, timegm(0, 0, 0, $dm, $m2n{$mn}, $yr-1900);
          }
          ($times[1] - $times[0])/86400;
        }

Here I used the 'timegm' function to turn each date into a Unix epoch
time (number of seconds since the start of 1970), subtract the two
epoch times to find the difference in seconds, and then divide the
result by 86400 to get the number of days difference.

1. It might seem as though this could produce a fractional result.
   The question is: Is the interval between consecutive midnights in
   GM time (Greenwich Mean) time always exactly 86400 seconds?  

   The answer is no, not exactly, because GMT days are occasionally
   86401 seconds long.  Astronomers throw in an extra 'leap second' at
   the end of June or December to keep the actual solar noon
   synchronized with chronological noon; the extra second is
   occasionally necessary because the Earth's rotation is gradually
   slowing.  (They might in principle subtract a second sometimes
   because of random rotational changes called 'nutations', but it's
   never happened.)  In particular, there were extra seconds at the
   end of June 1997 and December 1998.

   However, as far as I know, no Unix system actually uses GMT, and no
   'gmtime()' function actually calculates Greenwich mean time!
   Instead, they all use UTC (Coordinated Universal Time) which is
   just like GM time, except without the leap seconds; UTC days are
   always exactly 86400 seconds long.  My system even documents this:

       The ctime(), gmtime() and localtime() functions all take an
       argument of data type time_t which represents calendar time.
       When interpreted as an absolute time value, it represents the
       number of seconds elapsed since 00:00:00 on January 1, 1970,
       Coordinated Universal Time (UTC).

   So the function above works even across intervals where the GM
   calendar contains leap seconds, supposing that the 'timegm'
   function is actually producing UTC times.  If not, the final
   difference should simply be rounded off to the nearest integer.

2. A second issue is that in many localities, mostly in Europe and
   North America, there's a practice called 'Daylight Saving Time'
   where the clock is set forward in the spring and backward in the
   autumn.  For example, in Philadelphia, there was no local time
   02:30 Sunday April 7 2002:
   
        plover% perl -le 'print scalar localtime(1018160911)'
        Sun Apr  7 01:28:31 2002

   But an hour later it was:

        plover% perl -le 'print scalar localtime(1018160911 + 3600)'
        Sun Apr  7 03:28:31 2002

   April 7 was only 23 hours long, because 2:00-2:59 was missing;
   similarly, yesterday (Sunday, October 27) was 25 hours long.  GM
   time does not have any such adjustments, so they don't affect the
   sample function above.  However, several people used the
   'timelocal' function instead of 'timegm' in similar solutions,
   leading to incorrect results.  For example, when called like this

        days_diff("Mon Oct 21 2002", "Mon Oct 28 2002"); # Daylight saving

   their functions would return 7.04166666666606 instead of 7, because
   of the extra hour.  Again, rounding off would have solved the
   problem, but the solutions I saw that used timelocal() didn't round
   off either.

3. Here's a related issue.  One poster on the -discuss list presented
   a solution that would check the day of the week for validity, a
   reasonable addition:

        my $sttime = timelocal(0,0,1,$stdayno,$months{$stmonth},$styear);
        my @sttime = gmtime($sttime);

        if($stday ne $days[ $sttime[6] ]) { #  (gmtime)[6] is the dayname
           return "start date was invalid\n";
        }

   But there's an inconsistency in the implementation.  See it?  He
   uses timelocal() to convert the argument to epoch time, but then
   gmtime() to convert it back and get the day of the week.  That
   works OK where I live, because 1AM local time is either 5AM or 6AM
   UTC the same day, so the day of the week is the same.  But had this
   poster run his test program in Tokyo, where 1AM local time is 6PM
   UTC the previous day, he would never have had a success!

4. Here's a puzzling issue.  For various reasons, none of the
   'timelocal()' solutions posted to the -discuss list actually
   works.  The one I excerpted above assumes that the dates are in the
   format 

        Wednesday 16 October 2002

   but the problem specification calls for

        Wed Oct 16 2002

   Two others assume that the dates will be in the format

        Wed 9 16 2002

   which seems rather unlikely.

   I'm curious about why the authors of these functions didn't get
   this right.  Was it carelessness, or a deliberate modification of
   the spec?
 
5. The major defect of the sample solution above is that the timegm()
   function has a limited range.  It returns its result as an integer
   number of seconds since 1970; on machine with 32-bit integers, this
   covers a range of about 136 years, from Fri Dec 13 20:45:52 1901
   through Tue Jan 19 03:14:07 2038.  Outside this range, it throws an
   exception.  This may be acceptable to the accounting department,
   but the limitation should be noted.

6. Solutions using one of the heavyweight CPAN date calculation
   modules don't have this range limitation.  These modules also do
   all the difficult work.  Here's a solution I produced that uses the
   'Date::Calc' module:

        use Date::Calc 'Delta_Days', 'Decode_Date_US';

        sub days_diff {
          my @d = @_;
          s/^\w+// for @d;      # Remove day of the week
          Delta_Days(map Decode_Date_US($_), @d);
        }

   'Decode_Date_US' attempts to parse and translate a Date in US
   format, where the month precedes the day number.  Unfortunately,
   the days of the week confuse this function, so I have to strip them
   out first.  The function returns a year number, month number (1-12)
   and day number.  The 'Delta_Days' function takes two dates in year,
   month, day format and computes the number of days difference
   between them.

   Steve Smoot produced essentially the same solution:

        sub days_diff {
            return Delta_Days(Decode_Date_US(substr($_[0],4)),Decode_Date_US(substr($_[1],4)));
        }

7. Shawn Carroll benchmarked a Date::Calc solution against a
   Date::Manip solution and found that Date::Calc was about 80 times
   faster.  This is probably because Date::Calc is written in C, while
   Date::Manip is in pure Perl.  The Date::Manip manual contains an
   extensive discussion of this point, and the tradeoffs between
   Date::Calc and Date::Manip.

8. Some people coded the date calculations by hand.  This is tricky to
   get right, but has the benefit that if you get do it right, it
   doesn't have the range limitations of Time::Local.  Date
   calculations are really complicated, and tend to end up looking
   like a big stew, so I didn't bother to debug the ones that didn't
   work.  Here's one of the less stewish examples, provided by G. Rommel:

        sub days_diff {
         my ($start, $end) = @_;
         my ($wd1, $mo1, $day1, $yr1) = split ' ',$start;
         my ($wd2, $mo2, $day2, $yr2) = split ' ',$end;
        # Convert the month.
         my %mnum = ('Jan'=>0, 'Feb'=>1, 'Mar'=>2,
           'Apr'=>3, 'May'=>4, 'Jun'=>5,
           'Jul'=>6, 'Aug'=>7, 'Sep'=>8,
           'Oct'=>9, 'Nov'=>10, 'Dec'=>11);
        # Days before this month this year.
         my @db = qw(0 31 59 90 120 151 181 212 243 273 304 334);

         $mn1 = $mnum{$mo1};
         $startday = int(($yr1 - 1601) * 365.2425) + $db[$mn1] + $day1;
         $startday++ if $mn1 > 1 && ($yr1%4==0) && (($yr1%400==0) || ($yr1%100!=0));
         $mn2 = $mnum{$mo2};
         $endday = int(($yr2 - 1601) * 365.2425) + $db[$mn2] + $day2;
         $endday++   if $mn2 > 1 && ($yr2%4==0) && (($yr2%400==0) || ($yr2%100!=0));

         return $endday - $startday;
        }

   Rommel transforms each input date into a count of the number of
   days since the beginning of 1601.  (The 365.2425 is the average
   number of days in the Gregorian calendar year.)  He then subtracts
   the counts to get the difference.  Rommel notes that this approach
   fails to detect invalid dates ("Sep 37 2002" is interpreted as the
   same as "Oct 7 2002", for example) and that the function won't work
   after the year 5881210 because the number of days will no longer
   fit into an integer variable.

9. There was a long discussion on the -discuss list about Julian
   vs. Gregorian dates.  The calendar presently in use in most of the
   world is the Gregorian calendar, first introduced by Pope Gregory
   XIII in 1582.  Prior to this, most European countries used the
   Julian calendar, almost the same but with a different leap day
   schedule.  When you see a date like "Tuesday July 2 1776" there is
   a question about what it means; the same label may be applied by
   the Julian and Gregorian calendars to different days.

   There some complications on top of this:

        * Not every country switched to the Gregorian calendar at the
          same time.  Most Catholic countries switched immediately;
          other countries held out.  Great Britain and its colonies
          (including what would eventually become the USA) switched
          calendars in 1752.  Russia switched in 1918; this is why the
          October Revolution is now celebrated in November.

        * The switch was accompanied by a one-time modification of the
          calendar, to bring the dates back into line with the
          seasons.  In Spain, for example, October 1582 had only 21
          days.  In Great Britain, September 1752 had only 19 days.
          In Sweden there was a big mix-up too complicated to explain
          here.

   http://www.geocities.com/CapeCanaveral/Lab/7671/gregory.htm
   contains some interesting details about these issues.

   So one might ask: What is days_diff("Fri Sep 1 1752", "Sun Oct 1
   1752")?  In Spain, it's 30, as one would expect.  In England or the
   USA, one might like the answer to be 19---except that in England,
   Sep 1 1752 was a Tuesday, not a Friday.

   Several people tried to take this into account.  All of the
   solutions were locality-specific.  For example, one gentleman wrote
   a version that was accurate in France, taking into account that
   fact that in France, Dec 10 1582 was followed immediately by Dec 20
   1582.

   I feel that this is misguided.  It's interesting, but it's a lot of
   work and the payoff seems small.  The gentleman I mentioned before
   who included the correct adjustment for France had his function
   deliver an error if you asked for Dec 15 1582, which didn't exist
   in France (or, more precisely, there was no date with that name):

        if ($jd[$i] > 15821210 and $jd[$i] < 15821220) {
                print "Not a valid date:\nFrance switched to the Gregorian calendar in 1582\n";
                print "and 10 Dec 1582 was followed immediately by 20 Dec\n";
                exit;
        }

   This person would have had a much easier time if he had lived in
   Israel rather than in France.  The entire function could have been
   replaced with:

        sub days_diff {
          my $date = shift;
          print "There is no date called '$date'\n";
          exit;
        }

   (In the Hebrew calendar, "Mon Oct 28 2002"  is called "Heshvan 22, 5763".)

   If one is going to historically inaccurate date names, then why not
   also throw an error for 10 Dec 1793?  There was no date with that
   name either, in France, because the Gregorian calendar was
   abolished for 13 years after the Revolution and was replaced by a
   new calendar, in which 10 Dec 1793 was known instead as

        Decade II, Decadi de Frimaire, de l'Annee 2 de la Revolution

   or more succinctly, "20 Frimaire II", perhaps.

   And this says nothing about the question of whether France will
   still be using the Gregorian calendar 7,000 years from now.

   It's my feeling that if you're really trying to convert historic
   dates, the interface presented by days_diff() is hopelessly
   inadequate.  The sample solutions pretend that the Gregorian
   calendar was in use everywhere at all times; which isn't
   historically accurate, but it's probably the best you can do
   without expending an enormous amount of effort; see the GNU Emacs
   'calendar' package, for example.

   It didn't occur to me when I posed the question that people would
   get worried about this.  But the issue could turn out to be
   important for some applications.  For example, if you're trying to
   compute interest payments for money borrowed before the calendar
   change, it would be unfair to charge a full month's interest for
   September 1752 or October 1582 or whatever, when those months were
   ten or eleven days short.  But I think in such a case, you would
   really have to go back to the people requesting the function and
   ask what they wanted it to do.

10. Last week I observed with some surprise that when some code failed
    in some circumstances, people tended to come up with very
    complicated examples rather than simple ones.  I observed this
    again this week.  To illustrate the potential difficulty in
    handling Julian vs. Gregorian dates, folks brought up the
    September 1752 oddity in the Great Britain calendar.  A simpler
    example would be that

        days_diff("Xxx Feb 28 1700", "Xxx Mar 1 1700")

    should return 1 if the dates are interpreted as Gregorian dates,
    but 2 if they are interpreted as Julian dates, since the Julian
    calendar has Feb 29 1700 and the Gregorian calendar omits it.
                
11. Astronomers use their own modification of the Julian calendar;
    they label the dates with numbers, with day 0 being a certain day
    about 6700 years ago, and increasing by 1 each day afterwards.  If
    you could convert a (presumably Gregorian) date like "Wed Oct 16
    2002" to astronomical form, you could then subtract the day
    numbers of two dates to get the number of days in between.

    Unfortunately, nobody implemented this right.  One programmer who
    chose this path did this:

        sub days_diff {
             my($day,$month,$mday,$year) = split(/\s+/,$_[0]);
             my($day2,$month2,$mday2,$year2) = split(/\s+/,$_[1]);

             my $monthToNum = {
                 Jan => 1,
                 Feb => 2,
                 Mar => 3,
                 Apr => 4,
                 May => 5,
                 Jun => 6,
                 Jul => 7,
                 Aug => 8,
                 Sep => 9,
                 Oct => 10,
                 Nov => 11,
                 Dec => 12,
             };

             my $jd_1 = _jday($year,$monthToNum->{$month},$mday);
             my $jd_2 = _jday($year2,$monthToNum->{$month2},$mday2);

             return $jd_2 - $jd_1;
        }

        sub _jday {
             my($y,$m,$d) = @_;
             my $jd = ( 1461 * ( $y + 4800 + ( $m - 14 ) / 12 ) ) / 4 +
             ( 367 * ( $m - 2 - 12 * ( ( $m - 14 ) / 12 ) ) ) / 12 -
             ( 3 * ( ( $y + 4900 + ( $m - 14 ) / 12 ) / 100 ) ) / 4 +
             $d - 32075;
             return $jd;
        }

   The '_jday' function here is supposed to convert a year, month, and
   day to an astronomical Julian day number.  This programmer said:

        What I did do was Google for 'julian day' and I found
        http://hermetic.magnet.ch/cal_stud/jdn.htm

   That's a good approach in general, but unfortunately, he cribbed
   the code without reading the accompanying discussion on that page:

        Days are integer values in the range 1-31, months are integers
        in the range 1-12, and years are positive or negative
        integers. Division is to be understood as in integer
        arithmetic, with remainders discarded.

   This programmer's '_jday' function doesn't discard remainders, so
   it produces mostly wrong answers.  A correct version:

           sub _jday {
             my($y,$m,$d) = @_;
             my $a = int(($m-14)/12);
             my $b = int(( 1461 * ( $y + 4800 + $a ) ) / 4);
             my $c = int(( 367 * ( $m - 2 - 12 * $a ) ) / 12);
             my $e = int(( $y + 4900 + $a ) / 100 );
             my $f = int(( 3 * $e ) / 4);
             my $jd =  $b + $c - $f + $d - 32075;
             return $jd;
           }

   One other person posted an astronomical Julian day solution to the
   -discuss list, and made the same mistake.  I think the moral here is
   something about how you can't just paste code into your program and
   expect it to work.

12. Here's a small test suite:

        use Test;
        BEGIN {plan tests => 7 }
        END {
          ok(days_diff("Wed Oct 16 2002", "Wed Oct 23 2002"), 7);
          ok(days_diff("Wed Oct 16 2002", "Tue Oct 16 2001"), -365);
          ok(days_diff("Mon Oct 21 2002", "Mon Oct 28 2002"), 7); # Daylight saving
          ok(days_diff("Thu Oct 31 2002", "Fri Nov 1 2002"), 1);
          ok(days_diff("Sun Jun 29 1997", "Tue Jul 1 1997"), 2); # Leap second
          ok(days_diff("Wed Dec 30 1998", "Fri Jan 1 1999"), 2); # Last leap second
          ok(days_diff('Wed Jul 4 1776','Tue Jul 4 1976'), 73048);
        }

        1;

   To use this, put it in a file called 'DiffTest.pm'; then add the
   line

        use DiffTest;

   to the top of the file that contains your days_diff() function.

Thanks again to all the subscribers, and to those who participated in
the discussion.  I will send another quiz on Wednesday.

Sample solutions for this week's 'expert' quiz may be slightly delayed,
since I have some other things to attend to this afternoon.
Sample solutions and discussion
Perl Quiz of The Week #3 (20021030)


        Write a program, 'spoilers-ok'.  It will read the
        quiz-of-the-week email message from the standard input,
        extract the date that the message was sent, and print a
        message that says

                It is okay to send spoilers for this quiz

        or

                It is too soon to send spoilers for this quiz.
                You may send spoilers in another 4 hours 37 minutes.

        It becomes okay to send spoilers after 60 hours have elapsed
        from the time the quiz was sent.

        You can be sure that the 'Date' field of the QOTW email
        message will always be in the same format, which is dictated
        by internet mail format standards.  For those unfamiliar with
        this format:

                Date: Wed, 23 Oct 2002 16:10:15 -0400

        The "16:10:15" is the time of day.  "-0400" means that the
        time of day is in a time zone that is 4 hours behind
        Greenwich.

Effective use of modules turned out to be the key to the best
solutions to this quiz.  There are three key items:

  extract the date field from the mail header
  parse the date
  format the output

When I went to write up a sample solution last week, I knew there must
be a module for parsing mail headers, but I just couldn't find it.  I
spent some time looking for it, and then lost patience.  It turned out
to be Mail::Header.

This short and utterly straightforward solution was provided by Craig
Sanders:

        #! /usr/bin/perl -w

        use Mail::Header;
        use Date::Parse; 

        use strict;

        my $head = new Mail::Header [<>], Modify => 0;

        my $date = $head->get("Date");

        my $message_time = str2time($date);

        my $ok_time = $message_time + 3600 * 60;

        my $now = time();

        if ($now >= $ok_time) {
          print "It is okay to send spoilers for this quiz\n" ;
        } else {
          my $diff = $ok_time - $now ;
          my $hours = int($diff / 3600);
          my $minutes = int(($diff - $hours * 3600) / 60);
          print "It is too soon to send spoilers for this quiz.\n" ;
          print "You may send spoilers in another $hours hours $minutes minutes.\n" ;
        }

Some people (including me) wrote more than twice as much code to
accomplish the same thing.


1. People used a variety of date-parsing modules.  In addition to
   Date::Parse, people also used Date::Manip, Time::Local, and
   HTTP::Date.  But if you use Time::Local, you must extract and
   combine the parts of the date yourself; then there is a possibility
   to make a mistake.  One of the submitted solutions that used
   Time::Local made an error in the time zone handling:

        $release_time += ((-$3 * 36) + 216000);  # timezone, 60 hr.delay
    
   The (-$3 * 36) is the time zone adjustment here; $3 contains the
   time zone part of the date field.  This adjustment works for most
   time zones, but not all.  For example, had the quiz-of-the-week
   been sent from India, where the time zone is +0530 (five hours,
   thirty minutes) the calculated adjustment would have been 19080
   seconds, instead of 19800.  This is probably an argument in favor
   of the modules.

2. Craig's solution has a minor defect: at times, it will generate
   outputs like

        You may send spoilers in another 1 hours 1 minutes.

   This is bad English.  One easy way to take care of it:

        my $Hours   = $hours == 1   ? 'hour'  : 'hours';        
        my $Minutes = $minutes == 1 ? 'minute : 'minutes;        

        print "You may send spoilers in another $hours $Hours $minutes $Minutes.\n" ;

   Seth Blumberg used Lingua::EN::Inflect to handle this.

3. Another possible defect in Craig's solution is that if $diff is
   7379 seconds, the output is "... 2 hours 2 minutes"; but really
   it's 2 hours, 2 minutes, and 59 seconds.  There was a brief
   discussion of how to round off times; Kevin Pfeiffer observed:

        For dividing seconds into hours and minutes, I believe that a
        normal rounding operation is wrong. If you have 1.7 hrs, you
        don't want to round up, but rather take the 1 and leave the
        remainder to convert to minutes.

   To handle this in Craig's code, you could use:

          my $hours = int($diff / 3600);
          my $minutes = int(($diff - $hours * 3600) / 60 + .5);

   (The + .5 is the only new thing here.)

   Iain Truskett used Time::Duration to format the output, which takes
   care of the plural and the rounding issues.  It doesn't produce the
   specified output format; it might say "1 day 17 hours" instead of
   "41 hours 17 minutes".  Whether this is a bug or a feature is up to
   you.  

4. The solution I wrote up beforehand seems to me to be clearly
   inferior to Craig's; it's longer and more complicated because it
   does everything manually:

        #!/usr/bin/perl

        use Time::Local 'timegm';

        my $date_field;
        while (<>) {
          chomp;
          last unless /\S/;
          if (s/^Date:\s+//) {
            $date_field = $_;
            while (<>) {                # read continuation lines?
              last unless s/^\s//;
              chomp;
              $date_field .= $_;
            }
            last;
          }
        }

        die "No Date: field found\n" unless defined $date_field;

        # Typical value:
        # Wed, 30 Oct 2002 21:34:54 -0000
        my ($dy, $mo, $yr, $hr, $mn, $sc, $tzd, $tzh, $tzm) =
          $date_field =~ 
                   /\w\w\w,\              #  Day of week
                    ([\d\s]\d)\ (\w\w\w)\ (\d\d\d\d)\ # Day, month, year
                    (\d\d):(\d\d):(\d\d)\ # Time
                    ([+-])(\d\d)(\d\d)/x; # Time zone

        unless (defined $dy) {
          die "Couldn't parse Date: field\n";
        }

        my %mo = qw(jan 0 feb 1 mar 2 apr 3 may 4 jun 5
                    jul 6 aug 7 sep 8 oct 9 nov 10 dec 11);

        die "Unknown month name '$mo'\n" unless exists $mo{lc $mo};
        my $msgtime = timegm($sc, $mn, $hr, $dy, $mo{lc $mo}, $yr-1900);
        my $tz_adjust = ($tzm * 60 + $tzh * 3600);
        $tz_adjust *= -1 if $tzd eq '+';
        $msgtime += $tz_adjust;   # msgtime is now adjusted for time zone

        my $time_left = ($msgtime + 60 * 3600) - time();

        if ($time_left < 0) {
          print "It is okay to send spoilers for this quiz\n";
        } else {
          print "It is too soon to send spoilers for this quiz.\n";
          my $hr = int($time_left / 3600);
          my $min = int(($time_left - 3600*$hr)/60 + 0.5);
          my $hours = ($hr == 1 ? 'hour' : 'hours');
          my $minutes = ($min == 1 ? 'minute' : 'minutes');
          print "You may send spoilers in another $hr $hours $min $minutes.\n";
        }


   And sure enough, it did have a bug:  In the date-parsing regex, I
   originally wrote (\d\d) to match the day of the month, instead of
   ([\d\s]\d); as a result, any message sent in the first 9 days of
   any month would fail to match, and the program would die.


Thanks again for your interest.  I will send another quiz tomorrow; 
it will not contain any date arithmetic.





Sample solutions and discussion
Perl Quiz of The Week #4 (20021106)

        Two words are said to be 'anagrams' if the letters of one word can be
        rearranged to form the other word.  For example, in English, 'ascot'
        and 'tacos' are anagrams; so are 'tacos' and 'coats'.  A set of words
        that are all anagrams of one another is an 'anagram set'.  For
        example,

                ascot tacos coast coats

        is an anagram set.

        Letter case doesn't matter, so, for example, 'liberating' and
        'Gilbertian' are considered to be anagrams.  

        Write a program, make_anagrams, which reads a list of words, one per
        line, and finds all the anagrams in the the word list.  It should
        output an anagram listing, as follows:

        * 'Words' that contain digits or punctuational characters should be
          ignored.

        * Anagram sets that contain only one word should be omitted from the output.

        * If an anagram set contains two words, say 'halls' and 'shall', the
          output should contain two lines:

                halls shall
                shall halls

        * If an anagram set contains more than two words, the entire set
          should be listed under the alphabetically first word; the others
          should cross-reference it.  For example:

                headskin nakedish sinkhead
                nakedish (See 'headskin')
                sinkhead (See 'headskin')

        * Finally, the output lines should be in alphabetic order.

        For example, if the input was

                5th
                ascot
                ate
                carrot
                coast
                coats
                cots
                Dorian
                eat
                halls
                headskin
                inroad
                nakedish
                ordain
                Ronald's
                shall
                sinkhead
                tacos
                tea

        then the output should be:

                Dorian inroad ordain
                ascot coast coats tacos
                ate eat tea        
                coast (See 'ascot')
                coats (See 'ascot')
                eat (See 'ate')
                halls shall
                headskin nakedish sinkhead
                inroad (See 'Dorian')
                nakedish (See 'headskin')
                ordain (See 'Dorian')
                shall halls
                sinkhead (See 'headskin')
                tacos (See 'ascot')
                tea (See 'ate')

        If you need a sample input, you may obtain English word lists from 
                http://perl.plover.com/qotw/words/

        If you prefer to do this quiz in a language other than English, please
        substitute whatever conventions are appropriate for that language.


----------------------------------------------------------------

Here's some sample code:

        #!/usr/bin/perl

        while (<>) {
          chomp;
          next if /[^A-Za-z]/;
          my $key = join "", sort(split //, lc $_);
          $sets{$key} .= "$_:";
        }

        for my $wl (values %sets) {
          my @wl = split /:/, $wl;
          if (@wl == 2) {
            push @output, "$wl[0] $wl[1]";
            push @output, "$wl[1] $wl[0]";
          } elsif (@wl > 2) {
            my ($first, @rest) = sort insensitive @wl;
            push @output, "$first @rest";
            for (@rest) {
              push @output, "$_ (See '$first')";
            }
          }
        }

        for my $line (sort insensitive @output) {
          print $line, "\n";
        }

        sub insensitive {lc($a) cmp lc($b)}

The key to solving this problem was figuring out how to decide if two
words were anagrams.  Two words are anagrams if they have the same
letters, possibly in a different order.  The easy way to find out if
this is the case is to take the letters and put them in some canonical
order, for example by sorting them; if the sorted letter lists are the
same, then the words were anagrams.

Nearly everyone who solved this problem did so by splitting the
incoming words into letters, sorting the letters, and joining them
back together to form a hash key.  For example, the hash key for the
word 'Ethiopian' would be 'aehiinopt'.  Then you just list each word
under its own hash key, and you have a hash where the hash values are
lists of anagrammatic words.

The first 'while' loop reads the input, computes the key for each
word, and installs it into the %sets hash under the appropriate key.
At the end of the loop, the %sets values are lists of anagrams,
separated by ':' characters.  It would have been preferable to use a
hash of arrays, but _Learning Perl_ doesn't cover that.  A typical
value: $sets{'einrs'} is "resin:rinse:risen:siren:".

The second loop prepares the output.  It scans over all the anagram
lists ('$wl' stands for 'word list') and decides what the output will
look like for each list.  It appends lines of output to @output; later
these lines will be sorted and printed.

If there are exactly two words in $wl, then two output lines are
appended.  If there are more than two, then they're sorted into order,
the entire word list is listed under the first word, and the rest have
crossreferences.

Finally, the output lines are printed in alphabetical order.  The
utility function 'insensitive', which compares two strings without
regard to case, is used in two places to produce slphabetically sorted
lists.

There was some discussion about what to do when the input contained
two words with the same spelling but different case, as with 'Polish'
(pertaining to Poland) and 'polish' (to make something shiny.)  The
version above includes such words, which means that there are some
silly outputs that inform you that 'polish' is an anagram of
'Polish'.  If you don't like this, add

        next if $seen{lc $_}++;

in the top loop; this discards any word that has been seen before with
any capitalization.  (This is another application of the 'canonical
form' idea; this time the canonical form of a string is the
all-lowercase version.)

The 'expert' quiz postmortem is delayed because the results are so
very interesting.  There were many solutions posted to the -discuss
list and much discussion, and I don't want to leave out anything
interesting.  Expect it tomorrow, along with a pair of new quizzes.

Thanks to everyone who participated, whether or not they sent mail
about it.

Sample solutions and discussion
Perl Quiz of The Week #5 (20021113)

        You will write a function to lay out crossword puzzles.
        If you are unfamiliar with American and British style crossword
        puzzles, an example is at:

                http://perl.plover.com/qotw/misc/r005/puzzle.jpg

        Your function, 'layout_crossword' will get an array argument which
        represents the desired layout of the crossword puzzle.  it will then
        return a display version of the puzzle.  For example, the diagram at
        the URL above would be represented like this:

                @sample_puzzle = qw(
                    ....X.....X....
                    ....X.....X....
                    ....X.....X....
                    .......X.......
                    XXX......X.....
                    .....X......XXX
                    ......X...X....
                    ...X.......X...
                    ....X...X......
                    XXX......X.....
                    .....X......XXX
                    .......X.......
                    ....X.....X....
                    ....X.....X....
                    ....X.....X....
                );

        If given this array as argument, layout_crossword(@sample_puzzle)
        would return an array containing the following 61 strings:

############################################################################
#1   #2   #3   #4   ######5   #6   #7   #8   #9   ######10  #11  #12  #13  #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
############################################################################
#14  #    #    #    ######15  #    #    #    #    ######16  #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
############################################################################
#17  #    #    #    ######18  #    #    #    #    ######19  #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
############################################################################
#20  #    #    #    #21  #    #    ######22  #    #23  #    #    #    #    #
#    #    #    #    #    #    #    ######    #    #    #    #    #    #    #
#    #    #    #    #    #    #    ######    #    #    #    #    #    #    #
############################################################################
################24  #    #    #    #25  #    ######26  #    #    #    #    #
################    #    #    #    #    #    ######    #    #    #    #    #
################    #    #    #    #    #    ######    #    #    #    #    #
############################################################################
#27  #28  #29  #    #    ######30  #    #    #31  #    #    ################
#    #    #    #    #    ######    #    #    #    #    #    ################
#    #    #    #    #    ######    #    #    #    #    #    ################
############################################################################
#32  #    #    #    #    #33  ######34  #    #    ######35  #36  #37  #38  #
#    #    #    #    #    #    ######    #    #    ######    #    #    #    #
#    #    #    #    #    #    ######    #    #    ######    #    #    #    #
############################################################################
#39  #    #    ######40  #    #41  #    #    #    #42  ######43  #    #    #
#    #    #    ######    #    #    #    #    #    #    ######    #    #    #
#    #    #    ######    #    #    #    #    #    #    ######    #    #    #
############################################################################
#44  #    #    #45  ######46  #    #    ######47  #    #48  #    #    #    #
#    #    #    #    ######    #    #    ######    #    #    #    #    #    #
#    #    #    #    ######    #    #    ######    #    #    #    #    #    #
############################################################################
################49  #50  #    #    #    #51  ######52  #    #    #    #    #
################    #    #    #    #    #    ######    #    #    #    #    #
################    #    #    #    #    #    ######    #    #    #    #    #
############################################################################
#53  #54  #55  #    #    ######56  #    #    #57  #    #    ################
#    #    #    #    #    ######    #    #    #    #    #    ################
#    #    #    #    #    ######    #    #    #    #    #    ################
############################################################################
#58  #    #    #    #    #59  #    ######60  #    #    #    #61  #62  #63  #
#    #    #    #    #    #    #    ######    #    #    #    #    #    #    #
#    #    #    #    #    #    #    ######    #    #    #    #    #    #    #
############################################################################
#64  #    #    #    ######65  #    #66  #    #    ######67  #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
############################################################################
#68  #    #    #    ######69  #    #    #    #    ######70  #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
############################################################################
#71  #    #    #    ######72  #    #    #    #    ######73  #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
#    #    #    #    ######    #    #    #    #    ######    #    #    #    #
############################################################################

        (There are 61 lines here, each 76 characters long; the function should
        return a list of 61 strings, one for each line, with each string 76
        characters long.)

        layout_crossword() will make use of two auxiliary arrays that describe
        what empty and full squares should look like.  In the example above,
        these arrays were:

                   @empty_square = qw(######
                                      #....#
                                      #....#
                                      #....#
                                      ######
                                     );

                   @full_square = qw(######
                                     ######
                                     ######
                                     ######
                                     ######
                                    );

        layout_crossword() must scan over the input representation, constructing
        an array of strings, which it will return.  It must repeatedly insert
        the contents of @empty_square or @full_square into the appropriate
        place in the output array, depending on whether the corresponding
        character of the input was '.' (an empty square) or any other
        character (a full square.)

        layout_crossword() must also compute the appropriate numerals to insert
        into the blank squares, and insert them into the right places in the
        output.  A numeral should be placed into the upper-leftmost empty part
        of its square.  If the numeral doesn't fit in the square, the program
        should die.  Again, empty parts of a square are denoted by '.'
        characters.  

        An empty square should receive a numeral if it is the first square in
        a word; words run from left to right and from top to bottom.  See the
        diagram for an example.

        All '.' characters should be translated to real spaces on output, as
        in the example.

        The function should RETURN A LIST OF STRINGS, which might be printed
        later.  It should not print anything.

----------------------------------------------------------------

There was some confusion about the purpose of the @empty_square and
@full_square arrays.  (The 'glyphs'.)  When I thought up the question,
I wanted the function to get three arguments: The crossword template,
and the empty and full glyphs.  But there was no good way to pass
these three arguments to a single function without using references,
and the rules of the game say that the regular quiz must be soluble
using only the techniques explained in _Learning Perl_.  That means no
references.  So I compromised and had the glyphs passed via to
external, global arrays.

Unfortunately, I didn't make my intentions clear enough when I posed
the problem, and a number of people thought that the glyph arrays were
private to the layout_crossword function.  This was a perfectly
reasonable conclusion, since the problem statement erroneously
declared the arrays as 'my' variables.  Oops!  My apologies to anyone
who was confused by this.  When I tested the programs that were sent
to the qotw-discuss list, I hacked them all to use global glyph
arrays.  I also did other minor hacking that was necessary to make the
programs work with my test harness, which such hacking suggested
itself.

Three programs consistently produced the best-looking output: Ron
Isaacson's, which was by far the best, Alex Lewin's, and mine.  But
mjd1.pl was 43% shorter than lewin.pl and 51% shorter than
isaacson.pl, so that's the one I'll discuss in detail.  (Also, both
isaacson.pl and lewin.pl use references, and lewin.pl is in
object-oriented style, and so outside the scope of _Learning Perl_.)

There are two small utility subroutines:

        sub add_numeral {
          my ($n, @sq) = @_;
          my $space = '\.' x length($n);
          for (@sq) {
            return @sq if s/$space/$n/;
          }
          die "Square was too small for numeral $n\n";
        }

Here $n is a numeral, and @sq is a glyph array.  add_numeral() inserts
the numeral $n into the glyph array and returns the result.  It first
assembles a regular expression that looks for L dots in a row, where L
is the length of $n.  Then it scans the glyph from top to bottom,
looking for the L dots, which represent a space large enough to hold
the numeral.  When it finds one, it replaces the dots with the numeral
and returns the result.  If there is no place to put the numeral, it
dies.

One of the tests I ran used an 'empty' glyph that looked like this:

        ######
        ##  ##
        #    #
        #    #
        ######

A correct program will insert the numeral into the top space like this:

        ######        ######
        ##7 ##        ##17##
        #    #        #    #
        #    #        #    #
        ######        ######

Several of the less-correct programs assumed that the empty glyphs
would be completely empty inside their borders, and produced outputs
like these:

        ######        ######
        #7  ##        #17  #
        #    #        #    #
        #    #        #    #
        ######        ######

The other utility subroutine just gets an X and Y coordinate and the
puzzle template, and returns true if the corresponding square is
empty.  

        sub is_blank {
          my ($x, $y, @puzzle) = @_;
          return if $y < 0 || $x < 0 
                 || $y >= @puzzle || $x > length($puzzle[0]);
          return substr($puzzle[$y], $x, 1) eq ".";
        }

The reason this is here is to establish the convention that squares
outside the puzzle are considered to be full, not empty.  This
simplifies the process of determining whether an empty square should
receive a numeral.  The complete rule for deciding whether a square
gets a numeral is that a blank square gets a numeral if the square
above is full, if the square to the left is full, if it is in the top
row, or if it is in the leftmost column.  By adopting the convention
that squares outside the diagram are considered full, we can simplify
the logic for numbering squares: A square gets a numeral if the square
above or to the left is full.


The main function is fairly straightforward.  It loops over the rows
from top to bottom, and over the squares in each row from left to
right.  It sets '@square' to an appropriate glyph for the current
square, copying it from @full_square or @empty_square as appropriate,
and then, if empty, it uses add_numeral() to add a numeral to @square
if the square above or to the left is full.

        use strict;
        our (@empty_square, @full_square);

        sub layout_crossword {
          my @puzzle = @_;
          my $N = 1;
          my @result;
          my ($h, $w) = (scalar(@puzzle), length($puzzle[0]));
          for my $y (0 .. $h-1) {
            my @row;
            for my $x (0 .. $w-1) {
              my @square;
              my $blank = is_blank($x, $y, @puzzle);
              if ($blank) {
                @square = @empty_square;
                unless (is_blank($x-1, $y, @puzzle)
                     && is_blank($x, $y-1, @puzzle)) {
                  @square = add_numeral($N++, @square);
                }
              } else {
                @square = @full_square;
              }


Now there's the interesting question of what to do with the
overlapping parts of adjacent squares.  This program uses an extremely
simple strategy:  It trims off the left-hand edge of the square, so that:

        +----+           ----+   and   ######           #####
        |    |  becomes      |         #    #  becomes      #
        |    |               |         #    #               #
        +----+           ----+         ######           #####

Now the current square  can borrow the right-hand edge of the square
to its left.   Squares in the leftmost column have nobody to borrow
from, so the trimming is not performed for those squares:

              # trim off overlap with square to left
              if ($x > 0) {
                s/^.// for @square;
              }

Then we similarly trim off the topmost edge of each square, except for
those in the topmost row of the diagram:

              # Now trim off overlap with square above
              if ($y > 0) {
                shift @square;
              }

Now that the square is complete, we append it to the right-hand end of
the current row of the output:

              # add square to output
              for (0 .. $#square) {
                $row[$_] .= $square[$_];
              }
            } 

When we finish a row, we turn the dots into spaces, as required by the
spec, and insert the row into the return value array.  When we
finish the last row, we return the array:

            for (@row) {
              tr/./ /;
            }
            push @result, @row;
          }
          @result;
        }

----------------------------------------------------------------

Notes:

1. The test data and the test results are at

        http://perl.plover.com/qotw/misc/r005/

   The program above is

        http://perl.plover.com/qotw/misc/r005/mjd1.pl

   The subdirectory 'templates' contains four sample crossword
   templates.  The subdirectory 'glyphs' contains ten pairs of glyphs.

   The test harness, TestXWord.pm, tries the function on each
   combination of templates and glyphs, and deposits the output into
   the 'output' directory.  To use it, say

        perl -MTestXWord yourprogram.pl < /dev/null

   The 'check-results' program checks the outputs against the sample
   results in the 'standard' directory.  Files in the 'standard'
   directory have names of the form

        PUZZLE-glyphset-##.x

   where ## is a number of points and 'x' is an arbitrary letter.  If
   a program's output matches this file, it is awarded that many
   points.  If an output doesn't match any of the 'standard' forms, it
   is copied to the 'mismatches' directory.  I went over 'mismatches'
   repeatedly and copied all the 'mismatches' that actually looked
   good into the 'standards' directory.

2. The technique I used to abutting the squares is very simple, and
   produces good-looking output most of the time.  For some examples,
   it is not so good.  One of the test glyph sets, 'mixed', contains
   mismatched glyphs:

        +----+          ######
        |....|          ######
        |....|          ######
        |....|          ######
        +----+          ######

   With these glyphs, the asymmetry in my algorithm becomes obvious:

        ######----+----+----+#####
        ######1   |2   |3   |#####
        ######    |    |    |#####
        ######    |    |    |#####
        ######----+----+----+#####
        ######4   |    |    |5   |
        ######    |    |    |    |
        ######    |    |    |    |
        ######----+----+----+----+
        |6   |    |#####7   |    |
        |    |    |#####    |    |
        |    |    |#####    |    |
        +----+----+#####----+----+
        |8   |    |9   |    |#####
        |    |    |    |    |#####
        |    |    |    |    |#####
        +----+----+----+----+#####
        ######10  |    |    |#####
        ######    |    |    |#####
        ######    |    |    |#####
        ######----+----+----+#####

   When there's a disagreement between two adjoining cells about what
   their shared property should look like, the cell above or the cell
   to the left always wins.

3. Peter Haworth's program does better with the mixed glyphs, and is
   also very small.  I thought that getting this exactly right would
   be a big pain.  I even wrote code do to it, and then decided to
   leave it out of the question because it was too much code.  Peter
   cuts the Gordian Knot here and uses a very simple method.

   Peter's program starts by filling the entire grid with full-square
   glyphs, and then superimposes the empty-square glyphs on top of
   those.  Empty squares win whenever there is a disagreement about
   the appearance of shared territory.  For the template above, his
   program generates this output:

        #####+----+----+----+#####
        #####|1   |2   |3   |#####
        #####|    |    |    |#####
        #####|    |    |    |#####
        #####+----+----+----+----+
        #####|4   |    |    |5   |
        #####|    |    |    |    |
        #####|    |    |    |    |
        +----+----+----+----+----+
        |6   |    |####|7   |    |
        |    |    |####|    |    |
        |    |    |####|    |    |
        +----+----+----+----+----+
        |8   |    |9   |    |#####
        |    |    |    |    |#####
        |    |    |    |    |#####
        +----+----+----+----+#####
        #####|10  |    |    |#####
        #####|    |    |    |#####
        #####|    |    |    |#####
        #####+----+----+----+#####

   The difference is subtle, but I think it is much handsomer.

4. The algorithm I used to determine whether a square should receive a
   numeral fails in certain cases.  Consider this template:

        .....
        .#.#.
        .....
        .#.#.
        .....

   My program generates this output:    But it should be:
                                                          
        #####################        #####################
        #1  #2  #3  #4  #5  #        #1  #   #2  #   #3  #
        #   #   #   #   #   #        #   #   #   #   #   #
        #####################        #####################
        #6  #####7  #####8  #        #   #####   #####   #
        #   #####   #####   #        #   #####   #####   #
        #####################        #####################
        #9  #10 #   #11 #   #        #5  #   #   #   #   #
        #   #   #   #   #   #        #   #   #   #   #   #
        #####################        #####################
        #12 #####13 #####14 #        #   #####   #####   #
        #   #####   #####   #        #   #####   #####   #
        #####################        #####################
        #15 #16 #   #17 #   #        #6  #   #   #   #   #
        #   #   #   #   #   #        #   #   #   #   #   #
        #####################        #####################

   The problem here is that I generate a numeral for any empty square
   below (or to the right of) a full one, but it's only correct to
   generate a numeral for an empty square below (or to the right of) a
   full one that is not also above (or to the left of) a full one.  If
   there are full squares on both sides, the numeral is inappropriate
   because there is nowhere for the word to go.

   In American-style crossword puzzles, this situation can never occur,
   because there is an express prohibition on exactly this situation.
   Every blank square must be at the intersection of two words, one
   across and one down.  A square that is part of only a single word
   is called an 'unkeyed square' and is strictly forbidden.

   However, in many British-style crossword puzzles, most famously the
   London Times Sunday puzzle, there are unkeyed letters.  I
   specifically mentioned "British style crossword puzzles" in the
   question, so this is a defect.  To fix it, change

        unless (is_blank($x-1, $y, @puzzle) && is_blank($x, $y-1, @puzzle)) {
          @square = add_numeral($N++, @square);
        }

   to

        if (!is_blank($x-1, $y, @puzzle) && is_blank($x+1, $y, @puzzle)
          ||!is_blank($x, $y-1, @puzzle) && is_blank($x, $y+1, @puzzle)) {
          @square = add_numeral($N++, @square);
        }

   I forgot all about this until I looked closely at Peter Haworth's
   contribution, which gets it right.  (Peter, of course, is a Brit.)

5. ensch2.pl is a peculiar case.  Faced with the problem of how to
   overlap adjacent glyphs, Peter B. Ensch did something interesting.
   His program ensch2.pl inserted backspace characters between the
   glyphs:

        +---+^H+---+^H
        |   |^H|   |^H
        +---+^H+---+^H

   (I have represented the backspaces by '^H').  I didn't realize this
   at first.  When I went to look at the program's output, I used the
   'less' pager program, which interpreted the ^H's as requests to
   overstrike!  So when I used the pager, I got what looked like
   
        +---+---
        |   |   
        +---+---

   but with the middle vertical line in boldface.  When I just printed
   the output to the terminal with 'cat', it looked normal.  But to
   the automatic test suite, the answers looked completely wrong.
   The test suite hated it, but if the backspacing is allowed,
   ensch2.pl would be one of the better performers.

6. One common problem was programs that assumed that the glyphs would
   have a certain appearance, or would be a certain size.    Using the
   glyphs 

        ##    and    ..
        ##           ..

   caused problems for many peoples' programs, which could not figure
   out how to fit the numerals in, or which assumed the presence of a
   border.  The example program above completely botches these
   examples because it insists on overlapping the adjacent glyphs,
   even though that means stripping out 3/4 of each glyph.  For small
   glyphs, overlapping is a mistake.  Ron Isaacson's program was the
   only one posted on the qotw-discuss list that handled this case
   properly at all.  His program overlaps squares only if the empty
   and full square borders match.  This leads to cluttered but
   reasonable behavior in the 'mixed' case above:

        ######+----++----++----+######
        ######|1   ||2   ||3   |######
        ######|    ||    ||    |######
        ######|    ||    ||    |######
        ######+----++----++----+######
        ######+----++----++----++----+
        ######|4   ||    ||    ||5   |
        ######|    ||    ||    ||    |
        ######|    ||    ||    ||    |
        ######+----++----++----++----+
        +----++----+######+----++----+
        |6   ||    |######|7   ||    |
        |    ||    |######|    ||    |
        |    ||    |######|    ||    |
        +----++----+######+----++----+
        +----++----++----++----+######
        |8   ||    ||9   ||    |######
        |    ||    ||    ||    |######
        |    ||    ||    ||    |######
        +----++----++----++----+######
        ######+----++----++----+######
        ######|10  ||    ||    |######
        ######|    ||    ||    |######
        ######|    ||    ||    |######
        ######+----++----++----+######


   and perfect behavior in the very-small-glyph case:

        ##1 2 3 ##
        ##      ##
        ##4     5 
        ##        
        6   ##7   
            ##    
        8   9   ##
                ##
        ##10    ##
        ##      ##

7. There were a few oddities that caused some programs to appear to
   perform more poorly in the tests than was actually warranted.  The
   program sainio.pl used a global variable to store the current clue
   number, and never reset it between calls to layout_crossword().
   Since the test harness called layout_crossword() forty times in a
   row, the numbers grew to be four digits long and then wouldn't fit
   into the boxes any more.  sainio2.pl is a corrected version.

   schmidt.pl and schmidt2.pl copied the glyphs into two private
   arrays, @UL_empty_square and @UL_full_square.  Unfortunately, they
   did so at compile time, thus foreclosing the possibility that
   anyone could change the glyphs later, and preventing the test
   harness from changing the glyphs.  With this defect repaired, these
   two programs did well in the testing.

   As I mentioned above, a better design for this function would be
   for it to have three arguments instead of using the two global
   glyph arrays.  jones2.pm did do this, so it failed the tests.  I
   hacked it so that it used the specified argument format instead.
   But the output was quite broken!  It didn't use the correct glyphs,
   and it only numbered the 'across' clues!

        ++++++++++++++++
        +##+1 +  +  +##+
        +##+  +  +  +##+
        +##+  +  +  +##+
        ++++++++++++++++
        +##+2 +  +  +  +
        +##+  +  +  +  +
        +##+  +  +  +  +
        ++++++++++++++++
        +3 +  +##+4 +  +
        +  +  +##+  +  +
        +  +  +##+  +  +
        ++++++++++++++++
        +5 +  +  +  +##+
        +  +  +  +  +##+
        +  +  +  +  +##+
        ++++++++++++++++
        +##+6 +  +  +##+
        +##+  +  +  +##+
        +##+  +  +  +##+
        ++++++++++++++++


   wolters.pl produced some reasonable-looking outputs, but did not
   translate the dots to spaces, so the results looked like:

        +----+----+----+----+----+
        |####|1...|2...|3...|####|
        |####|....|....|....|####|
        |####|....|....|....|####|
        +----+----+----+----+----+
        |####|4...|....|....|5...|
        |####|....|....|....|....|
        |####|....|....|....|....|
        +----+----+----+----+----+
        |6...|....|####|7...|....|
        |....|....|####|....|....|
        |....|....|####|....|....|
        +----+----+----+----+----+
        |8...|....|9...|....|####|
        |....|....|....|....|####|
        |....|....|....|....|####|
        +----+----+----+----+----+
        |####|10..|....|....|####|
        |####|....|....|....|####|
        |####|....|....|....|####|
        +----+----+----+----+----+

   I added the line 

        tr/./ / for @puzzle;

   just before the return from the function.  However, it didn't
   correctly handle different-sized glyphs.

8. I got the idea for this problem from _The Art of Computer
   Programming, Vol. 1: Fundamental Algorithms_, by Donald E. Knuth.
   (In the 3rd edition, it is problem 1.3.2.23, and is on page 163.)
   I remembered that there was some issue in the Knuth problem that
   made it more difficult than the problem I was posing, but I didn't
   remember what is was, and I didn't look it up until just now.

   The Knuth version of the puzzle says that black squares at the
   border of the puzzle should be deleted from the output.  His
   example:  If the input was

                #....#
                ..#...
                ....#.
                .#....
                ...#..
                #....#

   Then the output should be

             +++++++++++++++++++++
             +01  +    +02  +03  +
             +    +    +    +    +
        +++++++++++++++++++++++++++++++
        +04  +    ++++++05  +    +06  +
        +    +    ++++++    +    +    +
        +++++++++++++++++++++++++++++++
        +07  +    +08  +    ++++++    +
        +    +    +    +    ++++++    +
        +++++++++++++++++++++++++++++++
        +    ++++++09  +    +10  +    +
        +    ++++++    +    +    +    +
        +++++++++++++++++++++++++++++++
        +11  +12  +    ++++++13  +    +
        +    +    +    ++++++    +    +
        +++++++++++++++++++++++++++++++
             +14  +    +    +    +
             +    +    +    +    +
             +++++++++++++++++++++

   Notice how the corner black squares have vanished.  If there were
   other black squares next to these, they would vanish also.  Knuth
   says: "The diagram... might have long paths of black squares that
   are connected to the outside in strange ways."

   Although my version of the problem was missing this complication,
   it had an additional complication because the full and empty square
   glyphs were variable instead of fixed.  My problem specification
   didn't provide much guidance about how to make the glyphs overlap,
   and in the case where the edges of the two glyphs didn't match, it
   wasn't immediately clear how to overlap them and still make the
   result look good.

9. It was pointed out on the -discuss list that the code I posted
   yields a warning, if warnings are enabled.  Specifically

                  my @empty_square = qw(######
                                        #....#
                                        #....#
                                        #....#
                                        ######
                                       );

   yields the warning "Possible attempt to put comments in qw() list".

   One poster to the list said:   

        Seeing as much of the Perl community have been trying to get
        new Perl programmers to turn on warnings and strict, in an
        effort to highlight problems with their code, I have been
        surprised to see MJD's quiz this week.  In order to use the
        empty_square and full_square arrays, as included in the quiz
        text, you are actually inclining people to turn off strict and
        warnings. Which IMHO is not good.

        http://perl.plover.com/~alias/list.cgi?1:mss:605

   This remark, unfortunately, comes right at the intersection of
   several philosophical stances I hold, and that makes me very
   cranky. 

   First, the warning has nothing whatever to do with 'strict'.
   Throughout this message, the author says "warnings and strict",
   "strict and warnings", as if in one breath.  The code in question
   is completely strict-safe.  (It *shouldn't* be, but that is an
   unrelated matter.)  Why mention 'strict' at all? 

   The Perl community has become increasingly dogmatic in the past few
   years about the use of 'strict'.  It is common to see people ask
   questions in newsgroups, and to post four-line examples, and be
   criticized for failing to use 'strict'.  "Why aren't you using
   'strict'?" people ask.  Well, because it is a four-line example
   posted in a Usenet article, obviously.  'strict' has no value in
   such cases, except perhaps to get people to shut up about it.

   It is true that the Perl community has been trying to get new Perl
   programmers to turn on warnings and strict.  I have no objection
   to this.   What I do object to is that the community seems to be
   trying to get people to turn on warnings and strict without knowing
   why they are doing that, or what they are for.

   It is common to see people ask questions in newsgroups like this:

        I got the error "Global symbol "$z" requires explicit package
        name."  What does that mean?

   This is like someone coming to say that there is a loud bell
   ringing in the hallway, and what should be done about it?  Of
   course, it is the fire alarm.  They were told to always turn the
   fire alarm on, but nobody told them what it would mean if it began
   to ring.

   I believe that one of the biggest problems with programming as a
   profession is that programmers are fearful and superstitious.
   Programming is only about sixty years old.  When chemistry was
   sixty years old, practitioners were trying to turn lead into gold,
   to extract the essence of fire, and so forth.  After a few hundred
   years they learned a little more and began to study phlogiston.  So
   we are in the dark ages of programming, and we live in a dangerous
   world that we do understand only poorly.  Many people respond to
   this with superstition: "Always use objects."  "Never use a global
   variable." "Perl is better than Python."  "Always use strict."

   We do not have to give in to this superstition.  We don't have to
   say "To be safe, always use strict.  And to be double safe, throw
   salt over your left shoulder."  We are engineers, and programming
   is empirical.  We should by all means encourage beginners to use
   the best possible engineering practices.  But we should not
   encourage the blind use of certain programming features.

   When a person says "warnings and strict" four times, when talking
   about a piece of code that emits a warning but is strict-safe, what
   is going on?  Clearly this person is not thinking about the meaning
   of what he is saying.  The code is also not ISO 9000 compliant; why
   not mention that also?  I think this is a superstitious effect.
   Instead of looking at the reality of the situation, which is that
   the code raises certain specific warnings, and the warnings have a
   certain specific meaning, which suggests that the code should be
   examined for certain specific problems, the speaker has lumped all
   warnings and diagnostics together in one group and adopted the
   stance that all such warnings should be eliminated.  This shows a
   lack of understanding.  

   I think almost anyone who says "always use strict" is suffering
   from this lack of understanding.  "strict" is not one but three
   features, and none of these three features has anything at all to
   do with the other two.  Saying "always use strict" is like saying
   "always use a hammer, a screwdriver, and a drill."  For some
   projects, perhaps only the hammer and drill are appropriate, and
   the screwdriver is an irrelevant distraction.  So it is too with
   "strict".  People are being encouraged to load up with tools that
   they don't know how to use.

   The results of this are sometimes stunningly silly.  I have many
   examples of programs that start by saying:

        use strict;
        my ($rounds, $round_temp, $squares, $page, $x, $y, $z, $cell,
        $player_move, @available_choices, $computer_move, @choices,
        $round, $winner, $player_move_pretty, $computer_move_pretty);
        my ($round_minus_one);

   The programmer here wants to use global variables; she does not
   understand what lexical variables or for, or why they are
   preferred.  But, at the advice of some well-meaning person, she has
   put 'use strict' at the top of the program, and now global
   variables are forbidden.  So she declares every variable at the top
   of the program, effectively making them all global, and getting
   none of the encapsulation, reuse, or maintenance benefits that
   lexical variables are supposed to accrue.  Another example: 

           my @ret=eval "layout_tree_$format(\$tree)";

   Why do this?  There is a safer and more efficient method:

           my @ret= "layout_tree_$format"->($tree);

   Perhaps the programmer didn't know about the safer and more
   efficient method.  Or perhaps he avoids it, as many people do,
   because it causes a 'strict refs' failure, while the 'eval' method,
   although inferior in every way, does not.

   I don't think we need to do more to encourage people to
   usewarningsandstrict.  I think we need to do more to encourage them
   to understand the warnings they get and to take appropriate action.
   When I teach programming classes, I am always astonished at how
   little attention the students pay to the error messages they
   receive.  The compiler complains of a syntax error on line 197, and
   the programmer's response is not to look at line 197, but to
   eyeball a random portion of the program in the hope that the error
   is there.  By encouraging people to "always use strict and
   warnings" and to think of diagnostic messages as bad, and something
   to avoid, we are doing the exact wrong thing.  The right thing is
   to encourage people to pay attention to the messages, to try to
   understand them, and then to make considered judgements about what
   they mean.  That is what I think beginners need to learn.

   In this case, the warning is saying "Possible attempt to put
   comments in qw() list".  What does that mean?  It means that perl
   has seen a # sign in a qw(), and it is afraid that I might be
   trying to write something like this:

        my @array = qw( red crimson   # But not scarlet
                        blue azure 
                        green
                      ); 

   Here the thing that looks like a comment is not a comment; instead,
   the @array gets nine elements, including 'But', 'not', 'scarlet',
   and '#'.  It is good that perl warns us about this.

   In my example code, however, this is not the case:

                  my @empty_square = qw(######
                                        #....#
                                        #....#
                                        #....#
                                        ######
                                       );

   I am *not* trying to put a comment into a qw() list.  Perl sees the
   '#' signs, and it is afraid that I *might* be doing that, so it
   warns me.  But it is mistaken; the # signs are doing what I want
   here.  In such a case, it is perfectly appropriate to ignore the
   warning.  The compiler has had its say, and I have listened to it,
   but it is just a machine, and I know better than it does what I
   want.  If you are troubled by the warning message itself, the
   correct approach here is NOT to code around it by writing something
   like this:

                  my @empty_square = ('######',
                                      '#....#',
                                      '#....#',
                                      '#....#',
                                      '######',
                                       );
                

   The correct response is to SHUT OFF THE WARNING:

                  my @empty_square;
                  { no warnings 'qw';
                    @empty_square = qw(######
                                       #....#
                                       #....#
                                       #....#
                                       ######
                                      );
                  }

   (The "no warnings 'qw'" declaration shuts off only those warnings
   that pertain to the qw() operator, and only inside that one block.
   Elsewhere, all warnings will still be issued.  Inside the block,
   all other warnings will still be issued.)

   The thing that really irks me about the 'strict' dogmatism is how
   defective is most of the dialog about it.  Last year I read a
   review of a book about using Perl to write CGI programs.  The
   reviewer harshly criticized the author for not having used
   'strict'.  The reviewer did not say which of the three parts of
   'strict' would have been valuable.  His opinion was apparently that
   all programs should use 'strict', whether it would be valuable or
   not.  I objected, pointing out that none of the programs in the
   book used references, so that 'strict refs' would not be doing
   anything; that none of the example programs were more than twenty
   lines long, so there was no practical difference between global and
   lexical variables, and hence no reason to use 'strict vars' to
   forbid global variables; and that the only value of 'strict subs'
   is to prevent future maintenance problems when someone adds a
   subroutine whose name is the same as what was previously a
   bareword, a feature of small value at best and of less value in
   these tiny example programs.

   But the reviewer did not address any of my specific technical
   points.  Instead, he told an anecdote about a bad programmer, and
   said that we should teach everyone "good programming style" right
   from the start.  That begs the question of what "good programming
   style" is.  I realized then that the reason for our disagreement
   was that my idea of good programming style was motivated by what
   was useful and effective, whereas his was motivated by
   superstition.  Considerations of usefulness did not come into play.

   That is my opinion on "use warnings and strict".  The short version
   is: No, I do not believe there is any inherent value in "keeping
   warnings and strict happy", and I am going to continue to try to do
   the most appropriate thing for the circumstances.  I believe that
   that is the only way to set the best possible example for
   beginners.

   Sorry to go on so long, but this has all been seething inside me
   for a long time.

New quizzes tomorrow.  My grateful thanks to everyone who participated
in the discussion, and also to those who quietly worked the problemns
on their own.  

Sample solutions and discussion
Perl Quiz of The Week #6 (20021120)


        Write a function, format_number_list, whose argument is a list of
        integers.  It then returns a string which represents the input list in
        compact, human-readable form.

        For example,

                 format_number_list(1, 2, 4, 5, 6, 7, 9, 13, 24, 25, 26, 27)

        will return

                "1-2, 4-7, 9, 13, 24-27"

        Also write a function, 'expand_number_list', which does the conversion
        in the opposite direction, so that

                expand_number_list("1-2, 4-7, 9, 13, 24-27")

        will return

                (1, 2, 4, 5, 6, 7, 9, 13, 24, 25, 26, 27)


----------------------------------------------------------------

I'll show solutions for format_number_list first.  I'm going to
present two sample solutions this week.  One was contributed by
Andreas Koenig:

        use Set::IntSpan; # CPAN rules :-)
        use strict;
        sub format_number_list {
          my(@n) = @_;
          my $set = Set::IntSpan->new(join ",", @n);
          my $run = $set->run_list;   
          $run =~ s/,/, /g; # give them the spaces
          $run;
        }                       

To summarize this solution: Set::IntSpan is a CPAN module that already
does almost exactly what was requested; Andreas simply wrapped it.
Andreas said: "I surely was amazed that nobody found it till monday."
I was amazed also.

Here's a synthetic but straightforward solution, from James Gray,
slightly modified by me:

        sub format_number_list {
            my @output ;
            while (@_) {
                my $range_start = shift;
                my $range_end = $range_start;

                # check if the numbers go in sequence from here
                $range_end = shift while (@_ && $_[0] == $range_end + 1);

                # ...and add to output accordingly
                if ($range_start == $range_end) { push @output, $range_start; }
                else { push @output, "$range_start-$range_end"; }
            }
            join ", ", @output;
        }

1. This quiz generated a large amount of discussion about what to do
   if the input list was out of order, if the input list contained
   repeated numbers, if the input list contained negative numbers, and
   so on.   had I thought about the problem more carefully before
   posting, I would have said something about some of these
   situations. 

   The application I originally had in mind was .newsrc files.  Lines
   of a .newsrc file indicate which news articles in a newsgroup have
   already beenread.  News articles are numbered starting from 1, so
   the negative-number issue never comes up, and the .newsrc line
   represents a set of integers, not a list, so order and repetition
   is a nonissue.  But I didn't say this in the question.

2. Then there was addition discussion about what the function should
   do if given numbers out of order.  For example, given (1,2,3,7,4,5,6),
   should it produce "1-3, 7, 4-6" or "1-7" or something else?  There
   were good arguments in both directions:

        * Maybe the order is significant; then you don't want to alter it.

        * If you sort the input before processing, you foreclose the
          possibility of the function ever treating (1, 3, 2)
          differently from (1, 2, 3).  But if you're careful not to
          change the order, the user who wants you to treat them the
          same can still call 

                format_number_list(sort numerically @nums);

        * If the purpose of the function is to generate
          'human-readable lists' then reordering the numbers for
          maximum compression is more likely to achieve that.

   For the .newsrc case, since the lists are actually sets, reordering
   makes sense.   For other applications, it may not.   I tested these
   separately.  

   There was also some discussion about whether (3, 2, 1) should turn
   into "3-1" or "3, 2, 1", supposing that the input was not to be
   reordered.  

3. If negative numbers are allowed, the required output format is ugly
   and hard to read.  (-3, -2, -1) would turn into "-3--1".  Some
   people opted for "-3..-1" instead.  

4. Some people also modified the output formats in various other ways.
   Then I had to hack on them to get them to work with the test
   harness.  You'd be surprised at how difficult it was to make
   trivial formatting changes in some of these programs.  In several
   cases I had to hunt down and change the same punctuation in several
   places.

5. With all this discussion, I was surprised to see how few of the
   submitted solutions actually worked properly for the
   straightforward cases: Postive integers in ascending order with no
   repeats.  Of the 21 samples I tested, several failed some of the
   basic cases.  (See
   http://perl.plover.com/qotw/misc/r006/RESULTS/format/NOTOK . )

6. Even the programs which were advertised by their authors as
   handling certain special cases, often didn't.

7. My conclusion from all this is that maybe people would do well to
   pay more attention to basic correctness in the simple cases before
   worrying a lot about making the functionality as complete as possible.


Now 'expand_number_list'.

Here's Andreas Koenig's Set::IntSpan version:

        use Set::IntSpan; # CPAN rules :-)
        use strict;

        sub expand_number_list {
          my $run = shift;
          my $set = Set::IntSpan->new($run);
          $set->elements;
        }


Here's Tom Varga's, cleaned up a little:

        sub expand_number_list {
            my @result ;
            for my $num (split(/\s* , \s*/x, $_[0])) {
                push(@result, ($num =~ /(\d+) \s* - \s* (\d+)/x) 
                                      ? ($1       ..      $2) 
                                      : $num
                     ) ;
            }
            @result ;
        }

8. Several people seemed to misunderstand that 'expand_number_list'
   was supposed to return a list of numbers, not a string.  

That's all for regular quiz #6.  I'll send a postmortem of the expert
quiz later on today, and new quizzes tomorrow.  Thanks to everyone who
participated.


Sample solutions and discussion
Perl Quiz of The Week #7 (20021127)


        A gentleman on the perl-qotw-discuss list reports:

        > In two different companies that I've worked at, the policy has been
        > that percentages in reports must always add up to 100% (at the cost
        > of munging the actual data).  It seems that otherwise end users
        > report it as a bug.

        This means, for example, that if you survey 300 people and find that
        100 prefer the color red, 100 prefer blue, and 100 prefer black, you
        are not allowed to report

                33.3 % prefer red
                33.3 % prefer blue
                33.3 % prefer black

        Because then the percentages appear to add up to only 99.9%.  Instead,
        you'll fib, by rounding one of the percentages up to 33.4% instead of
        down to 33.3 %:

                33.3 % prefer red
                33.4 % prefer blue
                33.3 % prefer black

        This, of course, is ridiculous, since it suggests that there were
        somehow more 'blue' responses than 'red' or 'black' responses, when
        there were in fact equal numbers of each.  But in the world of
        business the appearance of correctness is sometimes more important
        than actual correctness.

        Similarly, if you survey 70 people and find that 30 prefer red, 30
        prefer blue, and 10 prefer black, you may not say that

                42.9 % prefer red
                42.9 % prefer blue
                14.3 % prefer black

        because the percentages appear to add up to 100.1%.  You must adjust
        one of the percentages down by 0.1%.

        You will write a function, 'fudge_numbers', which takes the real data
        as input and returns the appropriate percentages.

        The first argument to fudge_numbers() will be special: It will be an
        integer, 0 or greater, indicating how many places after the decimal
        point will be retained after rounding.  An argument of 1 will mean
        that the percentages you return would be rounded to the nearest tenth
        of a percent, as in the examples above.  An argument of 0 will mean
        that the percentages should be rounded to the nearest percent; an
        argument of 2 will mean that the percentages should be rounded to the
        nearest hundredth of a percent.

        The remaining arguments to fudge_numbers() will be the actual data,
        which will all be non-negative numbers.

        The return value of fudge_numbers() will be a list of numbers
        indicating relative percentages.  There must be exactly one return
        value for each data argument.  The return values must be rounded off
        as indicated by the rounding argument, and they must total exactly
        100.  (Or as near as possible within the computer's limits of
        precision.)  

        For example, 

                fudge_numbers(1, 100, 100, 100) 

        should return

                (33.4, 33.3, 33.3)
        or
                (33.3, 33.4, 33.3)
        or
                (33.3, 33.3, 33.4)

        (All are equally acceptable.)

        Similarly:

                Arguments                       Return values

                1, 100, 100, 100                33.3, 33.4, 33.3
                0, 100, 100, 100                33, 34, 33
                2, 100, 100, 100                33.33, 33.34, 33.33

                2, 7, 7, 7                      33.33, 33.34, 33.33

                1, 30, 30, 10                   42.9, 42.9, 14.2
                                             or 42.9, 42.8, 14.3
                                             or 42.8, 42.8, 14.3

                1 z                             100
                        (here 'z' is any number)

----------------------------------------------------------------

I got a pleasant surprise while I was testing these.  I didn't solve
the problem myself until very late, because I couldn't think of a
solution, and because I hate floating-point numbers.  But then when
time came to write the report, I finally gave in and did it.  Then I
ran the test suite and started looking at the programs in order from
shortest to longest.  The shortest two didn't pass the tests.  The
third-shortest did, but when I read the code I scratched my head and
said "That can't work, can it?"  And then I added some more tests to
the test suite and found that it *didn't* work.   Then the next
four-shortest also didn't pass the tests, and that left my own late
entry as the shortest version that did pass the tests.  

Of course, it's still possible that someone might see it, scratch
their head, say "That can't work, can it?" and find the test that it
fails.  But until then, here it is:

        # Round $v to nearest integer
        sub round { sprintf("%.0f", shift) }

        # Add up the arguments
        sub sum {
          my $s = 0;
          $s += $_ for @_;
          $s;
        }

        sub fudge_numbers {
          my ($prec, @d) = @_;
          my $scale = 10 **  $prec;
          my $sum = sum(@d);

          # Scale data so that all significant digits are 
          # *left* of the decimal point
          @p = map $_*100*$scale/$sum, @d; 
          @r = map round($_), @p;       # rounded versions of @p
          @e = map $p[$_]-$r[$_], (0 .. $#r); # error

          # This is the number of jots by which the answer is too LOW.
          my $total_error = round(sum(@e));

          if ($total_error) {
            # Sign +1: numbers need to be increased. 
            #      -1: numbers need to be decreased
            my $sign = $total_error < 0 ? -1 : 1;
            $total_error *= $sign;      # absolute value

            # We want total_error equal to zero.
            # To achieve this, we will add a jot to the low numbers, 
            # or subtract a jot from the high numbers, as needful.

            for (0..$#r) {
              next unless $e[$_] * $sign > 0;  # Error goes the wrong way
              $r[$_] += $sign;          # Adjust value
              $total_error--;          
              last if $total_error == 0;
            }
          }

          map $_ / $scale, @r;  # Scale data back to percentages
        }

My background is in systems programming, and I think in my entire life
as a systems programmer I only ever used a floating-point number once.
I *hate* floating-point numbers, and I think it would be fair to
criticize me for avoiding them out of fear and ignorance.  But once
again, avoiding them turned out to be a good strategy.  I deal with
integers throughout.  If the input is

        (3, 50, 50, 50)

then instead of trying to come up with 33.333 / 33.333 / 33.334, and
worrying about the floating-point comparison issues, I try to come up
with 33333 / 33333 / 33334 and then scale the answers back to
percentages at the last moment.   That way I don't have to worry about
the fact that Señor Computadoro Estúpido thinks that
100 - (33.333+33.333+33.333) = 0.00100000000000477.

Let's consider (1, 2, 3, 5) with a precision of 2 as an example.
The program first computes the percentages, but scaled so that all the
significant figures are to the left of the decimal point. For the
example, the percentage values are

   909.090909090909
  1818.18181818182
  2727.27272727273
  4545.45454545455

representing 9.090909%, 18.181818%, etc.  This is the '@p'array.  Then
the program rounds off the percentages to the specified precision;
this just means rounding them off to the nearest integer, since we
scaled them for that exact purpose.  This is the '@r' array:

   909
  1818
  2727
  4545

The program then computes the difference between the true value (in
@p) and the rounded value (in @r); this is the 'error', stored in @e.
Since the true percentages must add up to 100%, and we want the
rounded values to do the same, we need to adjust the rounded values so
that the total error is 0.  $total_error is the sum of the values in
@e, and we would like it to be 0.  If it *is* 0, we don't need to do
any fudging at all, and we skip most of the rest of the function.

The big 'if' block in the middle of the function does the fudging.
First it calculates $fudge, which is +1 if the numbers need to be
fudged upward (because the total is too small, as with 33% + 33% +
33%) and -1 if the numbers need to be fudged downward (because the
total is too large, as with 17% + 17% + 17% + 17% + 17% + 17%.)  We'll
choose some of the elements of @r and add $fudge to them to make the
total come out right.  Because all the numbers have been scaled so
that the least significant place is just to the left of the decimal
point, we never need to consider a fudge amount other than +1 or -1.

Now we scan over the elements of @r looking for candidates for
fudging.  If the number is already too small, we mustn't fudge it
still further downward, and vice versa; the "next unless $e[$_] * $fudge < 0"
line takes care of this check:  the total rounding error for this element
must be in the *opposite* direction from the direction we're trying to
fudge.  

When we find a fudging candidate, we fudge it ($r[$_] += $fudge) and
then adjust the $total_error in the same way.  When the total error
reaches zero, no more adjustments are necessary.

After we've finished any necessary adjustments, we scale the adjusted
elements of @r back to the right size for percentages and return the
results.


*. This time there was no discussion of peculiar edge cases.  Are
   negative numbers allowed?  What if all the numbers are zero?
   Perhaps all the edge-case-fanciers were on vacation.

*. This problem turned out to be quite difficult to get right, much
   harder than I thought it would be.  Of 17 programs posted to the
   -discuss list, only 4 (from 3 authors) passed all the tests!  

   You should consider trying the test suite yourself.  You can obtain
   it from

        http://perl.plover.com/qotw/misc/r007/TestFudge.pm

   Then to use it, run the command

        perl -MTestFudge yourprogram.pl

   and look for 'not ok' in the output.  If your program fails a test,
   debugging it will probably be at least as instructive as doing the
   quiz in the first place.

   Thanks to Andreas Koenig for the tricky test case (#44) that caught
   out one of the submitted programs.

*. A very common error was to compute the fudge factor correctly and
   then to apply it to the wrong elements.  Many people assumed that
   any of the result values could be fudged.  But doing so can lead to
   bizarre results.  Nobody would accept (37, 23, 40) as a valid
   fudging of (33.3, 33.3, 33.3).  Similarly, once person said on the
   -discuss list:

        I do not think the improper result from

        [0 1 1 1 1 1 1]  -> [15 17 17 17 17 17]

        is a bug so much as an issue with the constraints of the
        problem.

    Maybe, but the problem said:

        The return value ... will be a list of numbers indicating
        relative percentages.... [which] must be rounded off as
        indicated by the rounding argument.

    There is no way to interpret '15' as 16 2/3 % (the exact relative
    percentage) rounded off to 0 decimal places.  (The person quoted
    above submitted a revised solution when this language was pointed
    out to him; nevertheless, even if I'd somehow left a loophole in
    the problem specification, what's the point of producing a
    solution that you know is defective just because you can weasel it
    through a loophole in the problem statement?  The Quiz of the Week
    is not mandatory.)

    Anyway, many results were misrounded even by solutions that were
    *not* deliberately ignoring the requirement to round off.  For
    example, test 104 concerned the data (2 2 1 1 1 1) rounded off to
    0 places.  The exact percentages are (25, 25, 12.5, 12.5, 12.5, 12.5).
    There are a lot of reasonable answers here, all of the form (25,
    25, 12, 13, 12, 13).   But what you *cannot* do is alter the 25,
    which is already exact.  There is no interpretation of 'round
    off' in which 25 is 'rounded off' to anything other than 25.

    Nevertheless, among the solutions submitted on the -discuss list,
    the 25 was 'rounded off' to 23, 24, 26, and 27.   (One hapless
    poster got the 25's right and then rounded off 12.5 to 14.)  All
    together, 10 of the 17 posted solutions failed this test.

    Similarly, faced with (2 1 1 1 1), where the exact answer was (33 1/3,
    16 2/3, 16 2/3, 16 2/3, 16 2/3), and the correct result would have
    been something like (33, 16, 17, 17, 17), eight of the 17 programs
    produced (32, 17, 17, 17, 17) instead.

*. Randal Schwartz said:

        I thought about a test harness for easy#7, but when I realized
        that the numbers could come back in any order, I punted. :)

   I had meant to require the percentages to be in the same order as
   the input data.  That is, given data (1, 2, 3, 4), the return
   values MUST be (10, 20, 30, 40), and not some other permutation of
   those.  But it turned out that nobody returned the results in the
   wrong order, so I didn't have to worry about it.
   

*. John Macdonald's third posted solution
   (http:°perl.plover.com/qotw/misc/r007/macdonald3.pl) is worth
   study, because it relies on a clever insight:

        Using truncation has some advantages.

        - As you point out, the numbers are within 1 in the last
          decimal place.

        - Fudging will always be incrementing, never decrementing.

        Then a light bulb went on.  If you use the technique of
        applying the fudge to the elements that had the greatest error
        from the truncation, the fudging process will then minimise the
        final error and come to the same result as when you start with
        rounding.

        And, because the fudging is always positive, the code is
        simpler.  I like that.

    This gave me that "Gosh, I wish I had thought of that"feeling.
    Modulo this insight, his program is very similar to mine.

*. Since at most one other person produced a correct answer, I thought
   I'd better look at it to see if it was doing anything different.
   It was sent in by Brian King.  The first thing that grabbed my
   attention was:

     my $tolerance = 0.000_000_001;    # 1 one-billionth should suffice.

   What's funny about this is that I had been about to write almost
   exactly the same thing in my own program, but then I got a nagging
   feeling about what would happen if the caller asked for their
   percentages rounded off to the nearest ten-billionth, and I
   couldn't see a way out, so I scrapped the whole idea and went with
   the always-use-integers approach that I showed above.  And sure
   enough, Mr. King's program produces the wrong answer for (10, 1, 1, 1).
   The output is (33.3333333333 33.3333333333 33.3333333333), but it
   should be     (33.3333333334 33.3333333333 33.3333333333).  Darn!

   That said, there were a couple of other things I found interesting
   about Mr. King's program.  It was one of the longer ones
   (second-longest, in fact) and I wondered why.  Mostly it seemed to
   be because of repeated code.  For example:

             if ( $off_by > 0 && ( abs($off_by) > $tolerance ) ) {
                 # if the overall difference is positive & we're still off...
                 if ( $out > ( ( $this / $sum ) * 100 ) ) {
                     # and if we rounded this one up
                     $out -= $precision;
                     $off_by -= $precision;
                     #round it down instead. update how much we're still off
                 }
             } ## end  if ( $off_by > 0 && ( abs($off_by...
             elsif ( $off_by < 0 && ( abs($off_by) > $tolerance ) ) {
                 if ( $out < ( $this / $sum ) * 100 ) {
                     $out    += $precision;
                     $off_by += $precision;
                 }
             }

   Here I would have at least eliminated the repeated

     abs($off_by) > $tolerance 

   test:

        if (abs($off_by) > $tolerance) {
          if ($off_by > 0) {
            ...
          } elsif ($off_by < 0) {
           ...
          }
        }

   but I would have preferred to somehow merge the two blocks into
   one.  And in fact, I *did* merge the two blocks into one; this code
   corresponds closely with the 

              next unless $e[$_] * $sign > 0;  # Error goes the wrong way
              $r[$_] += $sign;          # Adjust value
              $total_error--;          

   section of my example program.  Along similar lines, Mr. King's
   program has

     if ( $precision == 0 ) {
         $precision = 1;
     }
     else {
         $precision = '.' . '0' x ( $precision - 1 ) . '1';
     }

   but it would have been simpler to do

         $precision = 10 ** -$precision;

   Don't take these criticisms too seriously, since I wouldn't even
   have been looking at the code so closely if it hadn't outperformed
   almost all the other submitted programs; the one real defect could
   be fixed (if necessary) by adjusting $tolerance to a more
   appropriate value.

That's all for this week's regular quiz.  I'll send something about
the frost simulators tomorrow, and new quizzes on Wednesday.  My
thanks to everyone who contributed to the discussion, but also
especially to the people who worked the problem on their own.

Sample solutions and discussion
Perl Quiz of The Week #8 (20021211)

        Bill Gosper, a famous programmer, once said that a good way to
        manufacture word puzzles was to look through the dictionary
        for a word that contains a sequence of four letters that does
        not appear in any other word.  Then the puzzle is to guess the
        word, given only the four letters.

        For example, what common English word contains the contiguous
        sequence of the four letters 'acur'?  (Gosper says that you
        see this word every week, but that it will take you a month to
        figure out what it is.)

        Write a Perl program which, given a dictionary, generates two
        output files, 'questions' and 'answers'.  'questions' should
        contain every sequence of four letters that appears in exactly
        one word of the dictionary, one sequence per line.  'answers'
        should contain the corresponding words that contain the
        sequences, in the same order, again one per line.

        For example, given the trivial dictionary containing only

                arrows
                carrots
                give
                me

        The outputs should be:

                'questions'             'answers'

                carr                    carrots
                give                    give
                rots                    carrots
                rows                    arrows
                rrot                    carrots
                rrow                    arrows

        Of course, 'arro' does not appear in the output, since it is
        found in more than one word.



Here's a sample program, provided by Jonathan Scott Duff.  I trimmed
it a little.

        # Well, I see bunches of other people posting their solutions to the
        # regular quiz, so here's mine:
        # 
        #!/usr/bin/perl

        $SEG_LENGTH = 4;

        while (<>) {
           chomp;
           next if /\W/;		
           $w = lc $_;
           %w = map { substr($w,$_,$SEG_LENGTH) => 1 } 
                    0..length($w)-$SEG_LENGTH;

           for $w (keys %w) { 
              $wordmap{$w} = exists $wordmap{$w} ? undef : $_; 
           }
        }

        open(Q,">questions") or die;
        open(A,">answers") or die;
        for (sort keys %wordmap) {
           next unless defined $wordmap{$_};
           print Q "$_\n";
           print A "$wordmap{$_}\n"
        }
        close Q; close A;



The main data structure in the program is the hash %wordmap.  Keys in
%wordmap are strings of length 4.  The value associated with a key $k
is the word in which $k appears, if $k appears in only one word, and
an undefined value if $k appears in more than one word.

The program first converts each input word to all lowercase, and then
uses 'map' to construct a hash, %w, whose keys are the length-4
segments of the word.  For example, if the word is 'phlebotomy', the
hash is

        ('phle' => 1,
         'hleb' => 1,
         'lebo' => 1,
         'ebot' => 1,
         'boto' => 1,
         'otom' => 1,
         'tomy' => 1,
        )

The 1's aren't significant'; they're just placeholders.  Using a hash
in this way is a common Perl idiom for representing a set of strings.
The program then loops over the keys, looking up each one in %wordmap.
If a key was already in wordmap, then this is at least the second time
it has been seen, so the program sets the associated value to 'undef',
to indicate that it has appeared more than once.  If the key isn't in
%wordmap yet, then it's inserted into %wordmap, and the associated
value is the single word in which it has appeared.

After generating %wordmap, the program writes out the questions and
answers files, sipping over any elements of %wordmap whose values are
undefined.  

*. Since the values in the %w hash are never used or examined at all,
   it might seem that we could dispense with them, replacing

           %w = map { substr($w,$_,$SEG_LENGTH) => 1 } ... ; 
                    

           for $w (keys %w) { 
              ...
           }
   with

           @w = map { substr($w,$_,$SEG_LENGTH) } ... ;

           for $w (@w) { 
              ...
           }

   This was a common error in the submitted programs.  The problem it
   causes occurs with words like 'alfalfa' and 'lightweight' which
   contain the same sequence of four letters more than once.  

   The second version of the code sets @w to

        ('alfa', 'lfal', 'falf', 'alfa')

   and then iterates over this list, processing 'alfa' twice.  It then
   erroneously marks 'alfa' in %wordmap as appearing in two words when
   in fact it has appeared twice in only one word.  To avoid this, we
   must be sure to process each sequence of four letters at most once
   per word.  Storing the sequences as keys in the %w hash ensures
   this, because hash keys are unique.  The %w generated for 'alfalfa'
   is

        ('alfa' => 1, 'lfal' => 1, 'falf' => 1)

   and so iterating over the keys processes 'alfa' only once.

1. A way to fix the problem without introducing another hash appears
   in Ronald Kimball's program.  Ronald's program solves the problem
   more directly:

   for $w (keys %w) { 
      $wordmap{$w} = exists $wordmap{$w} && $_ ne $wordmap{$w} ? undef : $_; 
   }

   The second and subsequent times that the program sees a particular
   sequence, it throws away the stored word only if it's different
   from the current word.

   As written above, the program generates a huge number of
   'uninitialized value' warnings because the 'undef' values stored in
   the hash to indicate a sequence that has been seen two or more
   times are compared with $_.  Ronald's program uses 1 instead of
   undef, so doesn't generate any warnings. Another way to slience the
   code above is to shut off warnings.p

2. Mr. Duff's program, as submitted, actually finds unique sequences
   of 'n' letters, where 'n' defaults to 4, the number specified in
   the question.  If it's run as

        duff.pl -n 2 < dictionary

   it finds digraphs (pairs of letters) that occur in only one word
   each.  To make the code simpler, I trimmed this out and replaced
   the $opt{'n'} parameter with $SEG_LENGTH.

3.  As a trivium, here's the output for n=2:

        bg bw dz fc fj fp fw gj hq hv iy jr lj qa sj 
        vk vs vz wz xb xn xq xs xv xx yj yq zd zg zm zp

   Of these, 13 make good puzzles:

        bg bw fc fp fw gj hq lj sj wz xq xs yj

   The rest are either proper nouns (both English or otherwise)

        hv iy qa vs xb xn xv xx wq zd zg zm zp

   or are of visibly foreign origin ('resident aliens'):

        dz fj vk vz

   or are abbreviations:

        jr

   I think my favorite one is probably 'hq'.


4. People sometimes suggest that Perl's '..' operator should construct
   a backwards-counting range if the second operand is smaller than
   the first.  For example, they say that 4..0 should produce the list
   (4, 3, 2, 1, 0).  At present, it produces the empty list.

   This program demonstrates one of the many reasons why this is a bad
   idea.  Consider this part of the code:

           %w = map { substr($w,$_,$SEG_LENGTH) => 1 } 
                    0..length($w)-$SEG_LENGTH;

   Suppose $SEG_LENGTH is 4 and $w is "cat".  The operands of the '..'
   are 0..-1. With the existing semantics for '..', the '..' generates
   an empty list for 'map' it iterate over, the hash %w becomes empty,
   and the word is effectively skipped--just the right thing.

   With the defective alternative behavior, the '..' would generate
   the list (0, -1), and the 'map' generates the (bizarre) list 
   ('cat', 't').  To get correct behavior, the code would have to be
   adjusted with a special case to check for length($w) < $SEG_LENGTH.

   A similar example concerns this construction:

        @rest = @a[2..$#a];

   Here the intent is to copy the third-through-last elements of @a.
   For example, if @a contains 10 elements, $#a is 9, and @rest gets
   elements 2 through 9.  If @a contains only one element, the '2..1'
   expands to an empty list, and @rest is assigned nothing---which is
   just what was wanted.

   With the alternative semantics, the '2..$#a' expands to (2, 1), and
   @rest is assigned two undefined values.  Again, a special case is
   necessary to guard against precisely the behavior of .. that was
   proposed.  

   (If you do want to count backwards, use something like

        reverse(1..$n) .)

5. As usual, many people submitted programs that did not adhere to the
   interface I asked for  in the question, making various gratuitous
   changes to the input semantics, the output file names, the output
   format, or whatever.  Unlike in the past, I decided not to repair
   these.  

   The changes that puzzled me most were the ones that replaced the
   two output files ('questions' and 'answers') with a single output
   file.  I agree that this is simpler and more natural.  Usually I
   would have specified a single file with two columns.  But in this
   case that format is no good, because when you try to pick out a
   puzzle, you see the answer right next to it, which spoils the fun.

6. One common variation, particularly among the shorter programs, was
   to use a tricky regex to generate the substrings, instead of the
   loop shown above.  For example

        while (/(?=(.{4}))/g) {
          ...
        }

   was a popular trick.  (This iterates the 'while' loop once for each
   four-letter sequence, with $1 set to each sequence in turn.)  

7. When I first tested the programs, I got a surprise.  Everyone's
   programs ran very quickly, except mine, which was by far the
   slowest of the bunch.  I wondered what elementary mistake I must be
   making.  Unfortunately, it turned out to be an error in the test
   apparatus, not a programming mistake in my program. (I had been
   looking forward to discussing it.)  I had forgotten to trim the
   email headers out of the other programs, so mine was the only one
   that wasn't aborting immediately with multiple syntax errors.


Once again, my thanks to everyone who participated.  I will send out a
new quiz on yesterday.


Sample solutions and discussion
Perl Quiz of The Week #9 (20021218)


        You will write a simple spelling checker program, similar to
        the Unix 'spell' utility.

        The program should be called 'spel'.  It will read a document
        from standard input and print on standard output a list of all
        the misspelled words in the document.  If any command line
        arguments are given, 'spel' should read those files instead of
        the standard input.

        The output words should be in lexicographic order, and no word
        should appear more than once in the output.

        'spel' will be given one or more dictionaries of words that
        are already spelled correctly.  It will always try to read the
        file '/usr/dict/words'.  It will also try to read '.spel'
        files from certain directories.  If the user has set an
        environment variable SPELWORDS, 'spel' should interpret its
        value as a :-separated list of directories to be searched for
        '.spel' files.  If no SPELWORDS variable is set, 'spel' should
        search in the user's home directory and in the current
        directory.

        If you need a sample dictionary, you can obtain one from 

                http://perl.plover.com/qotw/words/


----------------------------------------------------------------

Here's sample code, submitted by Abigail:

#!/usr/bin/perl

#
# The exercise isn't clear what's to be considered a word,
# or how to deal with capitalization.
#
# This program considers words to be substrings consisting of only
# 'alpha' characters. This means that 'words' like "isn't" are
# considered to be two words, 'isn' and 't'.
#
# As for capitalization, words in the text should have the same
# capitalization as in the dictionary. However, since words starting
# a sentence are capitalized, we permit the first letter of a word 
# to be capitalized, even if the dictionary only has the all lower case
# version of the word. No attempt of parsing sentenses, trying to detect
# first words, has been made.
#

use strict;
use warnings;

my @std_dicts;    #  The default dictionaries.
my @spel_dirs;    #  Directories to look for .spel files.
my @dicts;        #  List of dictionaries.

my %words;        #  Words found in the dictionaries.
my %mistakes;     #  Mistakes in the file(s).

@std_dicts = ("/usr/dict/words",         # The exercise specifies this file,
              "/usr/share/dict/words");  # but on my system, it's found here.
@dicts = grep {-f} @std_dicts;           # So, we'll do some juggling,
splice @dicts => 1 if @dicts;            # making sure we use at most 1 file.

# Adding the ".spel" files.
@spel_dirs = defined $ENV {SPELWORDS} ? split /:/ => $ENV {SPELWORDS}
                                      : ($ENV {HOME}, ".");
push @dicts => grep {-f} map {"$_/.spel"} @spel_dirs;

#
# Init the dictionaries.
#
{
    local @ARGV = @dicts;
    while (<>) {
        chomp;
        $words {$_} = 1;
        $words {+ucfirst} = 1 unless /[[:upper:]]/;
    }
}

#
# Read the text, record all words not found in a dictionary.
#
while (<>) {$words {$1} or $mistakes {$1} = 1 while /([[:alpha:]]+)/g}

#
# Print the mistakes, sorted.
#
print "$_\n" for sort keys %mistakes;

__END__

----------------------------------------------------------------


Abigail's program has four phases.  First there's an initialization
section, in which it determines which dictionaries to use.  Then it
loads words from the dictionaries into a hash, %words.  Third, the
program loops over the manuscript input, checking each word against
%words.  Words not present in %words are noted in %mistakes.
Finally, the program prints out the words from %mistakes.

The initialization section first decides where the standard dictionary
is; my problem statement said it would be in '/usr/dict/words', but on
many systems (including mine---hmmm) it's in '/usr/share/dict/words'.
Abigail's code prefers the former if it exists, but if not it uses the
latter:

        @std_dicts = ("/usr/dict/words",         
                      "/usr/share/dict/words");  
        @dicts = grep {-f} @std_dicts;           
        splice @dicts => 1 if @dicts;            

Note that the code does not depend on there being exactly two items in
@std_dicts; you can list as many standard dictionaries as you want, in
the order you would prefer to try them, and the program will use the
first one it finds.  But I wonder if it might not have been more
perspicuous to write something like

        @std_dicts     = ("/usr/dict/words",         
                          "/usr/share/dict/words");  
        my ($std_dict) = grep {-f} @std_dicts;           

and then use

        local @ARGV = (@dicts, $std_dict);

later on.  

Initializing the dictionaries uses a handy trick that all Perl
programmers should be aware of.  Everyone knows about the <> operator,
which reads a line of input from the files named on the command line,
or from the standard input if none are named.  What many people aren't
aware of is that you can fool it about what the command-line files
are.  This is what Abigail is doing here:

        {
            local @ARGV = @dicts;
            while (<>) {
                chomp;
                $words {$_} = 1;
                $words {+ucfirst} = 1 unless /[[:upper:]]/;
            }
        }

'local @ARGV' temporarily resets the value of @ARGV, which is what <>
looks at to determine the command-line arguments.  Since the files
named in @ARGV are exactly the names of the dictionaries, the <>
operator reads one line at a time from each dictionary.  At the end of
the block, the effect of the 'local' is undone and @ARGV resumes its
original value.

The 'ucfirst' code here takes care of a detail that several submitters
forgot.  If the dictionary contains the word 'carrot', we would like
to accept both 'carrot' and 'Carrot' as correct.  The 'ucfirst' takes
care of this; if the word 'carrot' appears in the dictionary file,
then both 'carrot' and 'Carrot' are placed in the hash.  For the
dictionaries I supplied, the '/[[:upper:]]/' special case is never
meaningfully exercised.  It would become important if the dictionary
contained a word like 'iMac' which contained uppercase letters but
whose initial letter was not already uppercase.  The guard condition
would prevent 'IMac' from being added to the dictionary.  It's not
clear to me that this is really the right thing to do, however.
(Does 'iMac' get capitalized at the start of a sentence?  I don't know.)

The '+' on the 'ucfirst' prevents Perl from taking 'ucfirst' as a
literal hash key.

The manuscript input is read by another '<>' loop:

        while (<>) {$words {$1} or $mistakes {$1} = 1 while /([[:alpha:]]+)/g}

The order of control flow here may not be clear.  It's equivalent to:

        while (<>) {
          while (/([[:alpha:]]+)/g) {
            unless ($words {$1}) {
               $mistakes {$1} = 1;
            }
          }
        }

(This sort of thing is the reason that some people love statement
modifiers and other people hate them.)  

The interesting feature here is the use of '//g' to locate the words
in each line.  while (/..../g) { ... } in general will search through
$_, repeating the 'while' loop once for each time the pattern matches.
The pattern '[[:alpha:]]+' will match sequences of one or more
alphabetic characters.

Note that this treats the word 'isn't' as two words, 'isn' and 't',
and similarly 'pot-au-feu' as 'pot', 'au', and 'feu'.  To add
apostrophes and hyphens to the list of characters that may appear in a
'word', change this to

          while (/([[:alpha:]'-]+)/g) {

This is fraught with its own dangers; if the input now contains the line:

        to accept both 'carrot' and 'Carrot' as correct.  The 'ucfirst' takes

then "'carrot'" will be recognized as a 'word' and looked up in the
dictionary; but "'carrot'" isn't in the dictionary; only "carrot" is
present.  To properly handle all cases correctly can be rather tricky.

Finally, the list of misspellings is printed out with a
straightforward loop:

        print "$_\n" for sort keys %mistakes;

----------------------------------------------------------------

1. Loading the dictionary was a little tricky.  Several submitters
   wrote code like this to load the dictionary:

        while (<F>) { chomp; $WORDS{lc $_} = 1; }

   The words are smashed to all-lowercase before being stored, which
   leads their programs to accept some rather peculiar words.  For
   example, one of the dictionary files I supplied contains the 'word'
   'IEEE', the acronym for the Institute of Electrical and Electronics
   Engineers, which is likely to appear in many technical contexts.
   If the case is smashed, the spell-checker will silently accept the
   word 'ieee', and typically 'IEeE' and 'CArrot' as well.

   Some submitters forgot that 'carrot' in the dicionary indicates
   that 'Carrot' is also acceptable.  Some remembered, but got the
   code wrong.  For example, the solution I wrote before I posed
   the problem loads the words into a hash exactly as they are given,
   and then checks for a word's presence with:

        $bad{$_}++ unless $d{$_} || $d{lcfirst $_};

   This takes care of 'Carrot' properly, even if the dictionary
   contains only 'carrot'.  Unfortunately, it also causes the program
   to silently accept 'larry', even though the dictionary contains
   only 'Larry'.  Whoops!  It also refuses to accept 'CARROT'; I would
   consider this a bug.

2. The punctuational issue is one of those problems that gets more and
   more complicated the longer you look at it.  At first it seems that
   it can be solved by just treating hyphen and apostrophe as letters.
   But if you do that, your program fails on words that are quoted by
   being placed between apostrophes, as 'Carrot' is in this sentence.

   A second-order approximation is to trim punctuation from the
   beginning and the ending of each word before checking it, but then
   (as one submitter observed):

        Since I strip off all trailing punctuation, my program as it
        stands will flag 'words' such as 'etc.', 'a.c.', 'a.k.a.' as
        wrong, even if they are in the dictionaries used.


3. Nearly everyone's programs loaded the dictionary into a hash.  Two
   submissions didn't.  One loaded the dictionary into an array and
   did linear search on the array.  On a 10-word file, this program
   took 9 sec to check the file; Abigail's program took 25 sec; most
   of the extra time was taken up by constructing the hash.  (Some of
   the extra time occurred during the global destruction phase, after
   the program had completed; apparently the dictionary hash was
   destructed one key at a time, which I don't understand.)  But even
   with this extra overhead, the hash approach won for any file that
   wasn't trivially small.  For a 270-word file, the linear search
   program took 123 seconds; Abigail's program still took 25 sec.

   Another submission generated an enormous tree structure with hashes
   as the nodes.  This took a long time to build and search (17
   seconds to load a small dictionary that Abigail's program dealt
   with in less than 2 seconds) and a humongous amount of memory (I
   could not load the Web2 dictionary file.)

4. One submission contained the following code:

        if ($#ARGV == -1) {
            foreach (grep {!exists($words{lc $_})} split /\W+/, <>) {
                print qq("$_"\n);
            }
        } else {
            foreach(@ARGV) {
                open FILE, "< $_" or die "Couldn't open input, $_. $!";
                print "\nMispellings in $_:\n";
                foreach (grep {!exists($words{lc $_})} split /\W+/, <FILE>) {
                    print qq("$_"\n);
                }
                close FILE;
            }
        }

   The repeated 'foreach' block is a red flag; it suggests that the
   programmer should look for a way to merge the two blocks, and then
   see if that makes the code any easier to understand.  In this case,
   it's easy:

       foreach (grep {!exists($words{lc $_})} split /\W+/, <>) {
           print "\nMisspellings in $ARGV:\n" if $ARGV ne $prevARGV;
           print qq("$_"\n);
           $prevARGV = $ARGV;
       }

   Or perhaps

       @ARGV = ('-') unless @ARGV;
       foreach(@ARGV) {
           open FILE, "< $_" or die "Couldn't open input, $_. $!";
           print "\nMispellings in $_:\n";
           foreach (grep {!exists($words{lc $_})} split /\W+/, <FILE>) {
               print qq("$_"\n);
           }
           close FILE;
       }

5. A number of programs had extra features that I though substantially
   reduced the usefulness of the program.  For example, several of the
   programs produced extraneous output:

              print "\n$word misspelled at line $.\n";

   or

              print "\nMispellings in $_:\n";

   Extraneous output makes the program more difficult to use as a
   tool.  As specified, the program produces a list of misspelled
   words.  This list could be fed into another program, such as an
   editor, which could provide a convenient interface to correcting
   the misspellings.  The list could be piped into a program which
   might make guesses about what words were intended.  The output
   could be directed to the end of './.spel' and then edited.  Extra
   output makes all of these things more difficult.  At best, it would
   have to be filtered out before the output would be useful for
   anything other than human consumption.    Diagnostic messages, if
   they appear at all, should be printed to STDERR; that is what it is
   for.  

   Here's an example that's particularly egregious:

        print "Do you want to change the default spellcheck settings? (y/n): ";
        chomp (my $choice = <STDIN>);
        if (uc $choice eq 'Y') {
            print "Ignore words all in CAPITALS? (y/n): ";
            chomp (my $ch1 = <STDIN>);
            $ignore_caps = 'Y' if uc $ch1 eq 'Y';
            print "Ignore words with numbers? (y/n): ";
            chomp (my $ch2 = <STDIN>);
            $ignore_nums = 'Y' if uc $ch2 eq 'Y';
        }

   Using this program in any way other than as an interactive
   application is very difficult.  In spite of all the work that went
   into the interface, it remains inflexible; the program will only do
   the things that the programmer imagined.  Options like whether to
   ignore numerals should be specified non-interactively, on the
   command line.

6. Robin Szemeti suggested using the Search::Dict module.
   Search::Dict is one of those very clever pieces of software that
   never seems to be useful for anything.  The idea of Search::Dict is
   this: If the input file is in lexicographic order, items in it can
   be found with binary search; this should be quick and also
   memory-efficient, since the whole file needn't be loaded into
   memory at once.  

   I tried out a limited version of this, which looks words up in a
   single dictionary but doesn't actually construct the list of
   misspellings: 

        use Search::Dict;
        open W, "< Web2" or die $!;

        while (<>) {
          for (split /[^a-zA-Z'-]+/) {
            look *W, $_;
          }
        }

   On a 2387 word version of this postmortem file, Abigail's program
   took the usual 27 seconds; the program above took 13.  I was not
   expecting Search::Dict to do so well.  

   I worked up a working spelling checker based on Search::Dict and it
   still took about 13 seconds on the postmortem file.  I was
   surprised; I had expected the file reading overhead to be much
   higher.  On a 39,XXX word file, the hash approach was a big winner.
   Abigail's program still took about 27 seconds; the Search::Dict
   program took 192:

        #!/usr/bin/perl

        use Search::Dict;
        my @std_dicts = ("/usr/dict/words",         
                         "/usr/share/dict/words");  
        my ($std_dict) = grep {-f} @std_dicts;      
        my @spel_dirs = defined $ENV {SPELWORDS} ? split /:/ => $ENV {SPELWORDS}
                                                 : ($ENV {HOME}, ".");
        my @dicts = grep {-f} map {"$_/.spel"} @spel_dirs;

        my @fh;
        for my $dict (@dicts, $std_dict) {
          open my $fh, $dict or die "Couldn't open $dict for reading: $!";
          push @fh, $fh;
        }

        while (<>) {
          my @words = split /[^a-zA-Z'-]+/;
         WORD:
          for (@words) {
            next if $missp{$_};
            for my $fh (@fh) {
              look $fh, $_;
              my $w = <$fh>;
              chomp $w;
              next WORD if $w eq $_;
            }
            $missp{$_} = 1;
          }
        }

        print join "\n", sort(keys %missp), "";

        Search::Dict, by the way, has a lousy interface.  It looks in
        the file to find the first word that is equal to or greater
        than its argument, and leaves the filehandle positions at the
        place where it found that word; it also returns the position
        at which the handle was left.  But it doesn't return the word
        itself, or any indication of whether it matched the argument
        or not!  The return value, which is the file position, is
        useless, since you could have gotten it by doing tell() on the
        filehandle.  So my program has to reread the word at the
        current file position and then compare it with the argument
        word, even though Search::Dict has just finished doing this.

7. There were a bunch of defects which made me think that programs had
   not been well-tested.  One program wouldn't compile, because it had
   a missing semicolon.  One program did this:

          push(@dictionaries, qw(~/.spel .spel/));

   The author apparently didn't notice that neither of '~/.spel' or
   '.spel/' was ever read.  One program had 

        open DICT,"Web2" || die "Error: $!\n";

   so the error message would never appear; if 'Web2' was not present
   in the current directory when the program was run, it would
   cheerfully report every word as misspelled.

----------------------------------------------------------------

Other notes:

* Regular quiz #7 was about taking a list of numbers and then fudging
  the percentages so that they would add up to exactly 100%.  Douglas
  Wilson contributed a good explanation of why someone might really
  want to do this, apart from satisfying a demand from a clueless
  manager.  Suppose your company is in the business of selling
  assemble-it-yourself kits.  You would like to list a price for each
  individual part in the kit; the prices should add up to the cost of
  the whole kit exactly, even when they are rounded off to the nearest
  penny.  Mr. Wilson's message about this is at

        http://perl.plover.com/~alias/list.cgi?1:msp:1183

* Happy new year, everyone!  I will send out Quiz #10 later this evening.



Sample solutions and discussion
Perl Quiz of The Week #11 (20030206)


        Question #1:

        Why does Perl have the 'defined' function?  If you want to see
        if a variable contains an undefined value, why not just use
        something like this this?

                if ($var == undef) { ... }

'==' is for comparing numbers.  If its operands aren't numbers to
begin with, they are converted to numbers before being compared.  The
'undef' on the right is always converted to 0, so this test is that
same as comparing for numeric equality with 0.  In particular, the
test returns true when $var is 0, even though it is not undefined.

The test also fails for many strings:

        $var = "oops";
        if ($var == undef) { die }

This dies even though $var is certainly not undefined.

----------------------------------------------------------------

        Question #2:

        What's wrong with this code?

                %hash = ...;

                while (<STDIN>) {
                  chomp;
                  for my $key (keys %hash) {
                    if ($key eq $_) {
                      print "$key: $hash{$key}\n";
                    }
                  }
                }


The 'for' loop scans the hash looking for a particular key.  But the
whole point of a hash is that you *don't* have to scan it to find out
if it contains a certain key or not.  Hashes are organized so that
Perl can look up any given key instantly, without having to examine
each one.

The code here is analogous to searching the telephone book one name at
a time, starting from the first page, even though the telephone book
is carefully organized (in alphabetical order) so that you don't have
to do that.

A better way to write the code would be:


                %hash = ...;

                while (<STDIN>) {
                  chomp;
                  print "$_: $hash{$_}\n";
                }


This error is common in code written by beginning Perl programmers.
Here's some code that one of my interns once wrote:


        foreach $k (keys %in) {

        if ($k eq q1) {
                if ($in{$k} eq agree) {
                        $count{q10} = $count{q10} + 1;
                }
                if ($in{$k} eq disaagree) {
                        $count{q11} = $count{q11} + 1;
                }
        }
        if ($k eq q2) {
                @q2split = split(/\0/, $in{$k});
                foreach (@q2split) {
                        $count{$_} = $count{$_} + 1;
                }
        }
        if ($k eq q3) {
                $count{$in{$k}} = $count{$in{$k}} + 1;
        }
        if ($k eq q4a) {
                if ($in{$k} eq care) {
                        $count{q4a0} = $count{q4a0} + 1;
                }
                if ($in{$k} eq dontcare) {
                        $count{q4a1} = $count{q4a1} + 1;
                }
        }
        if ($k eq q4b) {
                if ($in{$k} eq wish) {
                        $count{q4b0} = $count{q4b0} + 1;
                }
                if ($in{$k} eq dontwish) {
                        $count{q4b1} = $count{q4b1} + 1;
                }
        }
        if ($k eq q5) {
                if ($in{$k} eq yes) {
                        $count{q50} = $count{q50} + 1;
                }
                if ($in{$k} eq "no") {
                        $count{q51} = $count{q51} + 1;
                }
        }
        if ($k eq q6) {
                if ($in{$k} eq yes) {
                        $count{q60} = $count{q60} + 1;
                }
                if ($in{$k} eq "no") {
                        $count{q61} = $count{q61} + 1;
                }
        }
        if ($k eq q7) {
                if ($in{$k} eq "accept") {
                        $count{q70} = $count{q70} + 1;
                }
                if ($in{$k} eq understand) {
                        $count{q71} = $count{q71} + 1;
                }
                if ($in{$k} eq other) {
                        $count{q72} = $count{q72} + 1;
                        $htmlout = comments;
                        open(COMMENTS, ">> /tmp/comments") || die "cant open comments";
                        print COMMENTS "$in{q7a}\n\n";
                        close (COMMENTS);
                }
        }
        if ($k eq q8) {
                if ($in{$k} eq yes) {
                        $count{q80} = $count{q80} + 1;
                }
                if ($in{$k} eq "no") {
                        $count{q81} = $count{q81} + 1;
                }
        }

        }  #end of foreach loop


Larry Wall, the inventor of Perl, has said:

        Doing linear scans over an associative array is like trying to
        club someone to death with a loaded Uzi.


---------------------------------------------------------------- 

       Question #3:


        What's wrong with this code?

                @matching_words = grep search_for($_, $text_file), @words;

                sub search_for {
                  my ($target, $file) = @_;
                  return unless open F, "<", $file;
                  while (<F>) {
                    return 1 if index($_, $target) >= 1;
                  }
                  close F;
                  return;
                }


There are several things wrong with the code.  Probably the biggest
problem is that the search_for function inadvertently destroys the
contents of @words.

Inside a 'grep' loop or a 'foreach' loop with no control variable, the
$_ variable is 'aliased' to the elements of the array.  This means
that you can look at $_ to see the current array element, and also
that you can modify $_ to modify the current array element.  A simpler
example is:

        @n = (1,2,3);
        for (@n) {
          $_ = 'blah';
        }

        print "@n\n";

This prints "blah blah blah".

Since $_ is a global variable, the assignment to $_ inside the
'search_for' function overwrites the aliased values in @words.

Other possible criticisms include:  (a) search_for performs a repeated
search that is probably wasteful; it would be better to convert it
into a hash lookup of some sort, if possible.  (b) If the rest of the
program happened to have a filehandle named 'F',  calling search_for
will close it.  For example, this doesn't work:

        open F, "myfile" or die ...;
        if (search_for("carrot", "otherfile")) { ... }
        my $next = <F>;

because F has been closed by 'search_for'.   This is a violation of
function encapsulation rules.  If the program who had F open before is
not the same as the one who wrote search_for, this is going to create
a bug that will be very difficult to track down.






Sample solutions and discussion
Perl Quiz of The Week #13 (20030528)

        The 'MH' mail system stores email messages in a 'folder',
        which is just a plain directory.  Messages are files in this
        directory whose names are numerals.  The directory might
        contain other files or subdirectories; these are not messages.

        The 'scan' command summarizes the contents of a folder.
        Here's a typical output:

         1349  03/22 Yitzchak Scott-Th  Re: Hey, is this list alive?<<On Sat, 22 Mar 20
         1350  03/23 To:perl-qotw-disc  Re: Hey, is this list alive?<<On Thu, 20 Mar 20
         1351  03/24 Kieran             <joke> Proposed advanced problem for this week.
         1352  03/25 John_Wunderlich@C  Re: Hey, is this list alive?<<This is a multipa
         1353  04/16 "Sparrow, Dave"    QOTW 10 [SPOILER]<<This message is in MIME form
         1354  05/24 Shlomi Fish        Post Mortem for Perl QOTW #8 (Graham's Function
         1355  05/24 Shlomi Fish        Post Mortem for Expert QOTW #8 (Graham's Functi
         1356  05/24 To:Shlomi Fish     Re: Post Mortem for Perl QOTW #8 (Graham's Func

        The output contains one line for each message.  It includes
        the message number (filename of the message file), date sent,
        sender (or recipient, if the sender is me), the subject, and
        as much of the body as will fit (as in message 1353, above.)
        Messages are always listed in numerical order.

        Implement a 'scan' command.  It gets one required argument,
        which is the path to the folder directory.  (The real 'scan'
        command is a little different, but that's not important
        today.)

----------------------------------------------------------------

There wasn't much to postmortem this time, since nobody posted any
sample solutions on the qotw-discuss list.  Probably mail handling is
too fussy and too tedious for most people to bother with.  So I wrote
a couple of versions myself.

My first version uses a few nonstandard mail-handling modules; my
second version does eveything manually.  I wasn't very happy with the
modules.  They're slow, and I found them poorly documented and hard to
use.  I believe they're over-engineered.  So this report will be about
the tradeoffs between the MailTools modules and the doing-it-manually.

Here's the version that uses the modules:

    #!/usr/bin/perl

    use Mail::Internet;
    use Mail::Field;
    use Mail::Address;
    use File::Spec;
    use POSIX 'strftime';

    my $dir = shift || '.';
    opendir D, $dir or die "Couldn't read directory $dir: $!; aborting";
    my @message_files = sort { $a <=> $b } grep {!/\D/ && -f "$dir/$_" } readdir D;
    closedir D;

    my %me;
    { my @addresses = $ENV{ADDRESS} ? 
        split(/,\s+/, $ENV{ADDRESS}) : guess_addresses();
      for (@addresses) { $me{lc $_} = 1; };
    }

There was some question about how to decide if a message had been sent
by the user running the program.  I said on the -discuss list that
anything reasonably reasonable would suffice.  My 'guess_addresses'
function tries to guess the user's address from various system
information, but allows itself to be overidden by the contents of an
ADDRESS environment varuable.   Here's 'guess_addresses':

    sub guess_addresses {
      my $username = $ENV{USER} || (getpwuid($<))[0] || return;
      my $host = $ENV{HOST} || do {
        require Config;
        "Config"->import;
        $Config{myhostname} . $Config{mydomain};
      } || return;
      "$username\@$host";
    }

I suppose 'Config' probably isn't in the Llama Book, but it's not an
essential part of the program, so if you don't like it, you can take
it out and replace it with something else.

Now the main loop of the program starts:

    for $msgno (@message_files) {
      local *F;
      my $file = File::Spec->catfile($dir, $msgno);
      unless (open F, "<", $file) {
        warn "Couldn't read message $msgno; skipping.\n"; 
        next;
      }
      my $mo = Mail::Internet->new(\*F);

Problem #1: Mail::Internet::new requires a glob reference argument,
which violates my Llama-features-only rule.  Problem #2:
Mail::Internet is mighty slow.

      my $h = $mo->head;
      { my $datefield = $h->get('date');
        my $time = $datefield ? Mail::Field->new('date', $datefield)->time 
                              : (stat($file))[9];
        $date = strftime("%m/%d", localtime($time));
      }

Mail::Field is part of the MailTools package.  Here it returns a
Mail::Field::date object, which supports a ->time method that converts
the date into an epoch time.  I used strftime() (which may not be
available everywhere) to convert this back to a date.  This seems like
an awful lot of machinery to use just to convert something like "Sun,
15 Jun 2003 01:55:30 +0200" to "06/15".  The possible upside is that
the date in the output shows that date when the message was actually
sent, relative to the user of the program.  For example, the "Sun, 15
Jun 2003 01:55:30 +0200" message was sent late on the evening of 14
June, Philadelphia time, and I live in Philadelphia.  However, that
doesn't seem to me like much of a benefit.

If there isn't a date in the message, we use the date that the message
file was written. "??/??" would also be a reasonable alternative.

Now the program deals with the sender's address:

      { for my $mf (Mail::Address->parse($h->get('from'))) {
          if ($me{lc $mf->address}) {
            $whom = "To:" . $h->get('to');
            last;
          } else {
            $whom = $mf->phrase || $mf->comment || $mf->address;
          }
        }
      }

It took me much longer to figure this out than it should have.  I
tried using Mail::Field again, and discovered that it's a tremendous
pain in the ass.  Unfortunately I don't remember most of the details.
Mail::Address seems to do the job adequately, however.

      $subjcontent = $h->get('subject') . "<<" . join "", @{$mo->body};
      $subjcontent =~ tr/\n//d;

      write;
    }

    format STDOUT =
    @####  @<<<< @<<<<<<<<<<<<<<<<  @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    $msgno,$date,$whom,             $subjcontent
    .


Does the Llama book cover formats?  I hope not.  Of course the format
here is not necessary; I could have used a simple 'print' or something
similar.  But opportunities to use formats come along so rarely that I
like to take advantage of them when they appear.  


Now here's the non-modules version.  The biggest cost is that it's
about 50% longer than the other version.

    #!/usr/bin/perl

    use File::Spec;
    use POSIX 'strftime';

    my $dir = shift || '.';
    opendir D, $dir or die "Couldn't read directory $dir: $!; aborting";
    my @message_files = sort { $a <=> $b } grep {!/\D/ && -f "$dir/$_" } readdir D;
    closedir D;

    my %me;
    { my @addresses = $ENV{ADDRESS} ? 
        split(/,\s+/, $ENV{ADDRESS}) : guess_addresses();
      for (@addresses) { $me{lc $_} = 1 };
    }

    sub guess_addresses {
      my $username = $ENV{USER} || (getpwuid($<))[0] || return;
      my $host = $ENV{HOST} || do {
        require Config;
        "Config"->import;
        $Config{myhostname} . $Config{mydomain};
      } || return;
      "$username\@$host";
    }

So far everything is the same.  But now because I'm not using a Mail::
module to deal with the RFC822-format date, I have to do this:

    my %m2n = (jan => 1, feb =>  2, mar =>  3, apr =>  4,
               may => 5, jun =>  6, jul =>  7, aug =>  8,
               sep => 9, oct => 10, nov => 11, dec => 12, );


    for $msgno (@message_files) {
      local *F;
      my $file = File::Spec->catfile($dir, $msgno);
      unless (open F, "<", $file) {
        warn "Couldn't read message $msgno; skipping.\n"; 
        next;
      }
      my %mo = read_message(\*F);

Instead of using Mail::Internet to read in the email message, I wrote
a replacement function.  Here it is:


    sub read_message {
      my $fh = shift;
      my ($header);
      my %m;
      {
        local $/ = "";
        $header = <$fh>;
        undef $/;
        $m{BODY} = <$fh>;
      }
      my @fields = split /\n(?!\s)/, $header;
      for (@fields) {
        my ($t, $v) = split /:\s+/, $_, 2;
        $m{lc $t} = $v;
      }
      %m;
    }

The split /\n(?!\s)/ is a little tricky, and I suppose (?!\s) is
non-Llama.  \n(?!\s) matches only those newlines which are *not*
followed by whitespace.  When we split the message header on those
newlines, we get an array of fields; each field may contain one or
more physical lines of the header.

The main program then continues; here's my quick and dirty code to
deal with RFC822-format dates:

      { my $datefield = $mo{date};
        if ($datefield && $datefield =~ /(\d+) (\w+)/) {
          $date = sprintf "%02d/%02d", $m2n{lc $2}, $1;
        } else {
          $date = strftime("%m/%d", localtime((stat($file))[9]));
        }
      }

In retrospect, it probably would have been better to avoid strftime()
here, since I could have done something like:

#      { my $datefield = $mo{date};
#        my ($m, $d);
#        if ($datefield && $datefield =~ /(\d+) (\w+)/) {
#          ($m, $d) = ($m2n{lc $2}, $1);
#        } else {
#          ($m, $d) = (localtime((stat($file))[9])))[4,3];
#          $m++;
#        }
#        $date = sprintf "%02d/%02d", $m, $d;
#      }

instead.  The double-list-slice on the 'localtime' line should make
Randal happy.

Next is the section which extracts the addresses from the 'from' and
'to' fields.  It depends on a homemade 'parse_addr' function, which is
the dodgiest part of the program.

      { my ($phrase, $addr, $comment) = parse_addr($mo{from});
        if ($me{lc $addr}) {
          ($phrase, $addr, $comment) = parse_addr($mo{to});
          $whom = "To:" . ($comment || $phrase || $addr);
        } else {
          $whom = $comment || $phrase || $addr;
        }
      }

Here's the rather questionable 'parse_addr':

    sub parse_addr {
      my $x = shift;
      my ($phrase, $addr, $comment) =
        $x =~ /([^<(]*) 
               (?: < ( [^<\s]* ) > )  ? \s* 
               (?: \( ([^\)]*) \))    ?  /x;
       $addr = $phrase unless defined $addr;
    #  warn "$x -> '$phrase', '$addr', '$comment'\n";
       for ($phrase, $addr, $comment) { 
         s/^\s+//;  s/\s+$//;
       }
       return ($phrase, $addr, $comment);
    }

It's questionable because it doesn't always work.  For example, it'll
misparse 

        "Joseph (Joe) Smith" jsmith@example.com

RFC822 address syntax is horrendously complicated and grossly overengineered.
But it works well enough for almost all examples that one encounters
in practice.  (Which is why RFC822 is overengineered.)  

Essentially, the idea is that an address wil have this format:

        PHRASE <ADDRESS> (COMMENT)

Where the address and comment parts might be missing.  Addresses might
also have this format:

        ADDRESS (COMMENT)

in which case the address will be mistaken for a phrase; in that case
we use the phrase as the address.  We use a rather ordinary regex to
extract the three parts.

Finally, the rest of the program is simple:

      $subjcontent = $mo{subject} . "<<" . join "", $mo{BODY};
      $subjcontent =~ tr/\n//d;

      write;
    }

    format STDOUT =
    @####  @<<<< @<<<<<<<<<<<<<<<<  @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    $msgno,$date,$whom,             $subjcontent
    .



The costs for the homegrown version are that it's 50% longer, and that
it doesn't work on peculiar addresses.  We also lost the benefit of
displaying the local date on which the message was sent; "Sun, 15 Jun
2003 01:55:30 +0200" is displayed as 06/15 even though the message was
sent on 06/14 Philadelphia time.

The benefits are that the program doesn't depend on a bunch of
nonstandard modules, and that the program runs *twelve* times faster.
Of course, anyone can write a program that runs really fast and
produces the wrong output.  But this program produces the right output
almost all the time, and it's hard to believe that fixing it would
slow it down by a factor of 12.  What wen't wrong?  I haven't looked
closely, but I suspect that Mail::Internet is way overwritten.  

I've placed complete source code at

        http://perl.plover.com/qotw/misc/r013/

Thanks to everyone who participated quietly and said nothing. 
I can confidently predict that I will post new quizzes on June 11.



Sample solutions and discussion
Perl Quiz of The Week #14 (20030611)

        Write a program which generates an HTML table of contents for
        a directory of files.  It should read the directory, producing
        a list of the files, and write out a file "toc.html" in that
        same directory, in the following format:

                <title>Table of Contents for [Directory Name]</title>

                <h1>Table of Contents for [Directory Name]</h1>
                <a href="file1">file1</a><br>
                <a href="file2">file2</a><br>
                ...
                <a href="lastfile">lastfile</a><br>

        The files should be listed in alphabetic order.  The program
        should accept a command-line argument that tells it what
        directory to index; if the argument is omitted, it should
        index the current directory.

----------------------------------------------------------------

I only saw one solution posted on the perl-qotw-discuss list, from
Pr. Offer Kaye; I also wrote one myself.  Offer's uses a number of
standard modules to take care of escaping HTML and URLs:

    use strict;
    use warnings;
    use Cwd;
    use URI::file;
    use HTML::Entities;

    my $toc_file = "toc.html";
    if ($#ARGV > 0) {die "Incorrect usage. Use \"$0 dir_name\" or just:
    \"$0\"\n"}
    my $dir_name = ($#ARGV == 0) ? $ARGV[0] : cwd();
    opendir(DIR, $dir_name) or die "Can't opendir $dir_name: $!\n";
    my @list = sort {lc($a) cmp lc($b)} readdir(DIR);
    chdir $dir_name or die "Couldn't chdir to $dir_name for some reason...\n";
    open(OUT,">$toc_file") or die "Couldn't open $toc_file for writing: $!\n";
    print OUT "<html><head><title>Table of Contents for ";
    print OUT encode_entities($dir_name),"</title></head><body>\n";
    print OUT "<h1>Table of Contents for ";
    print OUT encode_entities($dir_name),"</h1>\n";
    for (@list){
       next if (($_ eq ".") or ($_ eq ".."));
       my $u = encode_entities(URI::file->new($_));
       print OUT "<div><a href=\"$u\">$u</a></div>\n";
    }
    print OUT "</body></html>";

Most of this seems completely straightforeard and I don't have much to
say about it.

I wonder about using 'cwd()' instead of just '.'. I suppose '.' might
not be portable, but if you want it to be portable, it would seem to
be to be simpler to use File::Spec->curdir(), which then just returns '.'.

The output from Pr. Kaye's program is a little peculiar if the
filenames are peculiar.  For example, I created a file named
'<blink>ooky' to make sure that things didn't start blinking.  The
encode_entities call turns this into '%3Cblink%3Eooky'.  This is fine
when it appears as a URL, but as HTML text, it's wrong; you actually
see '%3Cblink%3Eooky' on the page, percent signs and all.  It should
have been turned into '&lt;blink&gt;ooky', which would have displayed
as '<blink>ooky'.

Since I expected everyone would use the modules, I thought I'd see if
it was posible to do it without the modules, while still using only
features from the Llama book.  This turns out to be possible, but just
barely.  (The modules are *not* discussed in the Llama book.)    The
main part of the program, of course, is almost the same:

        for my $f (sort readdir D) {
          my $url = escape_url($f);
          my $html = escape_html($f);
          print "<a href=\"$url\">$html</a><br>\n";
        }

The big question is how to implement 'escape_url' and 'escape_html'
with only Llama features.  'escape_html' is quite easy:

        sub escape_html {
          my $t = shift;
          $t =~ s/&/&amp;/g;
          $t =~ s/</&lt;/g;
          $t =~ s/>/&gt;/g;
          return $t;
        }

For this project, this is just fine.  'escape_url', however, is much
trickier.  Normally, I would write something like this:

        sub escape_url {
          my $url = shift;
          $url =~ s/([^0-9A-Za-z.-_])/sprintf "%%%2x", ord $1/ge;
          return $url;
        }

This uses *three* non-Llama features:

  1. The ord() function

Page 166 says "While constructing and interpreting such a byte string
is fairly straightforward using 'chr' and 'ord' (not presented
here)...".  Since they're not presented, I can't use them.

  2. sprintf "%x"

'printf' and 'sprintf' are discussed, but never the "%x" escape.

  3. s///e

s/// is of course discussed in detail, but never /e.  The book
mentions it in passing at the very very end; there's an example on
page 253.  I almost went ahead and used it.  Then I noticed that
typeglobs are also mentioned on page 253, so that allowing page 253
features would make a mockery of the 'Llama only' restriction.  So I
decided that s///e was forbidden.  (Reminder: The 'Llama only' rule
applies only to me; everyone else can of course use whatever they want
to.)

The big difficulty: how to convert "x" to hexadecimal without using
ord() and sprintf()? 

I briefly considered using something like

        s/%/%25/g;
        s/\./%2c/g;
        s/</%3c/g;
        ...
        [150 more of these]
        ...
        
and decided that it was much too gross to hold up as an example.  I
was going to come up with a plausible, instructive solution, or none
at all.  You will have to be the judge of whether I succeeded.

It turns out that I could build ord(), because it is the same as
unpack("C"), which the book discusses in some detail.

        sub Ord {
          my $chr = shift;
          unpack "C", $chr;
        }

And once I have a number, I can use ordinary arithmetic to convert it
to hex:

              my @hex = (0 .. 9, 'a' .. 'f');
              my $h0 = $ord % 16;
              my $h1 = ($ord - $h0) / 16;
              $result .= "%$hex[$h1]$hex[$h0]";

So here's 'escape_url':

        sub escape_url {
          my @chars = split //, $_[0];
          my $result = "";
          for (@chars) {
            my $ord = Ord($_);
            if ($good_char{$_}) {
              $result .= $_;
            } else {
              my $h0 = $ord % 16;
              my $h1 = ($ord - $h0) / 16;
              $result .= "%$hex[$h1]$hex[$h0]";
            }
          }
          return $result;
        }

%good_char just describes the characters that don't need to be
escaped.  The program works fine if you leave it empty, and you can do
that if you like.  I defined it this way, which produces nicer output:

        my %good_char;
        {
          my @good_char = ('-', '.', '_', 0 .. 9, 'A' .. 'Z', 'a' .. 'z');
          for (@good_char) { $good_char{$_} = 1 }
        }

Here's my program in full:

        #!/usr/bin/perl -w

        use strict 'vars';

        sub Ord {
          my $chr = shift;
          unpack "C", $chr;
        }

        my %good_char;
        {
          my @good_char = ('-', '.', '_', 0 .. 9, 'A' .. 'Z', 'a' .. 'z');
          for (@good_char) { $good_char{$_} = 1 }
        }
        my @hex = (0 .. 9, 'a' .. 'f');

        my $dir = shift || '.';
        opendir D, $dir
          or die "Couldn't open directory $dir: $!; aborting";

        my $title = "Table of contents for $dir";

        print "<html><head><title>$title</title></head>

        <body><h1>$title</h1>
        ";

        for my $f (sort readdir D) {
          my $url = escape_url($f);
          my $html = escape_html($f);
          print "<a href=\"$url\">$html</a><br>\n";
        }

        print "</body></html>\n\n";

        ################################################################


        sub escape_url {
          my @chars = split //, $_[0];
          my $result = "";
          for (@chars) {
            my $ord = Ord($_);
            if ($good_char{$_}) {
              $result .= $_;
            } else {
              my $h0 = $ord % 16;
              my $h1 = ($ord - $h0) / 16;
              $result .= "%$hex[$h1]$hex[$h0]";
            }
          }
          return $result;
        }

        sub escape_html {
          my $t = shift;
          $t =~ s/&/&amp;/g;
          $t =~ s/</&lt;/g;
          $t =~ s/>/&gt;/g;
          return $t;
        }


----------------------------------------------------------------

1. Alert readers will notice that my program forgot to escape the
   directory name in the title of the document.  I didn't realize this
   until I saw Pr. Kaye's solution.   Whoops!

2. I violated my own spec, which said "[the program] should write out
   a file "toc.html" in that same directory...".  When time came to
   write the program, I decided the spec was dumb, and opted to have
   it write the output to STDOUT instead.  If you think the specified
   behavior is better, just add

        open STDOUT, ">", "$dir/toc.html" or die ...;

   near the top of the program.   I still think the specified behavior
   is dumb.  The program is more flexible without it.    What was I
   thinking, anyway?

3. Pr. Kaye says:

        Note that my solution will not work for ALL cases- just the
        more common.  Plus, the HTML file created is very simplistic.
        A better ('better' as in more complete/robust) solution would
        perhaps be to determine the proper encoding (either based on
        the filenames or through a command-line switch) and create a
        (valid) XHTML file, properly formatted for that encoding. Or
        perhaps use UTF-8 regardless.  

   This is a huge issue that I never considered at all.  It reminded
   me of the time I asked for some simple calendrical computation and
   the -discuss list was awash with people asking about the French
   revolutionary calendar.  But Pr. Kaye raises a good point.
   Pr. Kaye lives in Israel, and may very well encounter files whose
   names contain Hebrew characters.  I have no idea of the corect way
   to deal with this.  I would not be surprised to learn that my
   program fails miserably when presented with filenames containing
   Hebrew characters.  (I would also not be surprised to learn that it
   works perfectly as long as the output file contains a line
   specifying that the character encoding is UTF-16.  I really can't
   exaggerate my ignorance here.)

   Pr. Kaye continues:   

        Another point is that I haven't got a way to test this on
        different file systems, so I'm not sure it will work
        everywhere- although I've done my best to try to make the
        solution robust in that sense. But I'm pretty sure that the
        script will not currently work across file systems.

4. Robert Spier asks:

        At this point, I'm stumped.  What does Pr. stand for?

   I liked refering to people as "Mr." in previous reports.  But then
   I got worried, because this assumes that all these people are men.
   I do not want to assume that, and really, I don't know.  So I
   decided to invent a new formal title for programmers that would
   evade the issue.  'Pr.' stands for 'Programmer'.  

   Lawyers get to impress people by putting 'Esq.' after their names.
   Now programmers can impress people by using a special title too.

Thanks again to everyone who particpated, including Pr. Kaye and also
those people who worked the problem in private and said nothing.  I
will post another quiz tonight.



[[ When I sent out this week's quiz, I forgot to mention that it had
   been contributed by Geoffrey Rommel, who also contributed the
   discussion below.   Thank you, Pr. Rommel! - MJD ]]



        This quiz is phrased for Unix systems. If it makes sense to
        write a solution for Windows or other systems, feel free to do
        so.

        The usual way to look for a character string in files in Unix
        is to use grep. For instance, let's say you want to search for
        the word 'summary' without regard to case in all files in a
        certain directory. You might say:

        grep -i summary *

        But if there is a very large number of files in your
        directory, you will get something like this:

        ksh: /usr/bin/grep: arg list too long

        Now, you could just issue multiple commands, like this:

        grep -i summary [A-B]*
        grep -i summary [C-E]*
        etc.

        ... but that's so tedious. Write a Perl program that allows
        you to search all files in such a directory with one command.

        You're probably wondering:
        - Should I use grep? egrep? fgrep? Perl's regex matching?
        - Should there be an option to make the search case-sensitive or not?
        - Should we traverse all files under all subdirectories?

        You can decide for yourself on these questions. There is one
        other requirement, though: the program must not fail when it
        finds things for which grepping does not make sense
        (e.g. directories or named pipes).

----------------------------------------------------------------


This quiz was suggested to me by a directory on one of my servers where all
of our executable scripts are stored. This directory now has over 4200
scripts and has gotten too big to search.

The solution shown here works for my purposes, but I do not wish to
depreciate the ingenious solutions found on the discussion list. I will try
to evaluate and discuss them in a separate message.

As MJD mentioned, Perl regex matching is clearly superior to the
alternatives. Since the original purpose was to search a directory of
scripts, the search is not case-sensitive; that option could be added
easily enough. We search only files (-f) in the specified directory, not in
lower directories. I also test for "text" files (-T) because my Telnet
client gets hopelessly confused if you start displaying non-ASCII
characters.


#!/usr/bin/perl
# The bin directory is too large to search all at once, so this does
# it in pieces.
($PAT, $DIR) = @ARGV[0,1];
$DIR ||= "";
die "Syntax:  q16 pattern directory\n" unless $PAT;

open(LS, "ls -1 $DIR |") or die "Could not ls: $!";

@list = ();
while (<LS>) {
   chomp;
   push @list , (($DIR eq "") ? $_ : "$DIR/$_");
   if (@list >= 800) {
      greptext($PAT, @list);
      @list = ();
   }
}
greptext($PAT, @list);

close LS;

exit;

sub greptext {
 my ($pattern, @files) = @_;

 foreach $fname (@files) {
    next unless -f $fname && -T _;
    open FI, $fname;
    while (<FI>) {
       chomp;
       print "$fname [$.]: $_\n" if m/$pattern/oi;
    }
    close FI;
 }
}





----------------------------------------------------------------

[[ Administrative note:  So far very few people have contributed
   quizzes.  Right now we have one expert and one regular quiz ready
   to go.  We need more, because unless more are contibuted, we will
   run out in two weeks.

   This mailing list has 1257 people subscribed to it.  If each person
   contributed just one quiz, we would be all set for the next 24
   years.

   Please send quizzes, or even just quiz ideas, to perl-qotw-submit.

   Thanks,   - MJD ]]

