[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]

Regex backreference bug in recursion/thread (status & questions)




Aside:
  This problem has been known and discussed for quite a while, at least
  since Nov 98 on p5p. However, this is all new to me. As I work towards a
  solution, I would appreciate any corrections, suggestions, advice, etc.


Bug Manifestation:
  The value of the $digit variables, a backreference in a successfully
  matched regex, can be overwritten during subsequent recursion calls, or
  by concurrent threads.

  Simple test code for recursion:

      $a = "abc";
      regex_recur ($a);

      sub regex_recur{
        my $b = shift;
        if ($b =~ /(.)(.+)/) {
          regex_recur ($2);
          print "$1\t$2\n";
        }
      }

  The result is (due to bug):
       b       c
       b       c

  The result should be:
       b       c
       a       bc


Possible work-around:

  Recursion:
    Copy values in $digit variables to my() variables before subsequent
    recursion call:

      sub regex_recur{
        my $b = shift;
        my ($first, $second);
        if ($b =~ /(.)(.+)/) {
          $first = $1; $second = $2;
          regex_recur ($second);
          print "$first\t$second\n";
        }
      }

Thread:
  Force blocking on regex (possibly by locking a dummy variable, or by
  using semaphores) and copy $digit values to my() vars before exiting
  block. (Is there such a call as unlock()? Should there be? )

      sub regex_thread{
        my $b = shift;
        my ($first, $second, $dummy);
        $second = "";
        {
          lock ($dummy);
          $b =~ /(.)(.+)/);
          $first = $1; $second = $2;
        }
        yield();
        print "$first\t$second\n";
      }


Description of regex match in perl core:
    [ This is where error corrections would be _greatly_ appreciated! ]
  Every regex is compiled into a single MATCH OPnode in the perl OP tree.
  As the perl NFA engine traverses back and forth in the target string,
  all state information is stored in struct regexp, which is defined in
  regexp.h. The struct data members that we are concerned with are:

    I32  *startp;
    I32  *endp;
    char *subbeg;
    I32   sublen;

  As the match process proceeds (see function Perl_pregexec() or
  Perl_regexec_flags() in file regex.c), the value of the $digit variables
  can be calculated/evaluated based on these data. Once the match has
  successfully completed, the calculated value of the $digit variables are
  valid and useful to the next OPcode. Depending on the user's Perl
  script, the $digit variable might then be pushed onto the stack, etc.


Statement of Problem:
  The memory locations for the regex struct data members that define the
  $digit variable are essentially statically allocated inside the MATCH
  OPcode during the compilation phase of perl in the function
  regcomp.c:Perl_pregcomp() (not C "static" data, but you know what I mean
  - not dynamically allocated). Then, every time a recursive call, or
  thread invocation, execution sequence hits that MATCH OPcode, it is
  using the _exact_ same memory locations to store startp[i], endp[i],
  etc. as any previous invocation. Therefore, the value of the $digit
  variables gets overwritten and lost unless their values have been copied
  out first.


Possible Solution:
  Every recursive invocation of a subroutine, and every thread invocation,
  have their own scratchpad. Therefore, use of a scratchpad array could
  provide dynamically allocated, independent SV instances to hold the
  stuff we care about. This would change the struct regex data members
  from (I32 *), etc to pointers into the _current_ scratchpad. This would
  require allocation/declaration (?) of the pad entries in regcomp.c:
  Perl_pregcomp() instead of the current use of Newz(), etc. Access to
  these data from the regex engine would no longer be though a simple
  indirection into the regex struct, but would have to point into the
  correct location of the _current_ scratchpad.


Questions I have:
  * Is it ok to have invocations/calls to an OPcode from _within_ another
    Opcode node? I think this would be required inside regcomp.c:
    Perl_pregcomp() to allocate scratchpad entries. But it might wreak
    havoc with debugging tools like perl -Dx.

  * Would it make sense to do the actual memory allocation for the pad
    values for startp[], etc at the very beginning of the regex execution
    (I'm not _exactly_ sure where that is _all_ the time!). That way new
    memory allocation would happen at every recursion or thread entry
    point.

  * Would it make sense to use standard scratchpad api calls (padsv,
    padav) to assign/retrieve values of startp[i], etc during the regex?
    I am concerned that this might slow down the regex engine. However,
    if memory locations are touched by explicit indirections into the
    scratchpad, that would require making assumptions about the scratchpad
    infrastructure, which might break if anything in _that_ code gets
    modified.


--
Harry Wolfson
HarryWolfson@LL.MIT.EDU


Follow-Ups from:
Harry Wolfson <HarryWolfson@LL.MIT.EDU>
Gurusamy Sarathy <gsar@ActiveState.com>

[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index][Top&Search][Original]