findRegexp question

Trying to understand regexp… When i do
"foo".findRegexp("(foo|bar)");
Why do i get duplicate matches?
-> [ [ 0, foo ], [ 0, foo ] ]

Is this how findRegexp is supposed to work?

Yes, it is how it’s supposed to work.

SC can’t really take responsibility for this. Internally, we call regex_search() from boost/regex.hpp. The search results are 100% determined by this library – we only convert the raw results into an object structure that SC understands.

So boost/regex maintainers are the people you should ask about details.

Parentheses are considered “capturing subgroups” in regexp, and (though I’m no regexp expert) I’m quite sure there are many use cases where subgrouped expression may be complex, with a lot of separate matching entities, but you also want the entire subgroup matching string in total. So standard regexp behavior is to give you one match for the entire parenthesized group, and a match for each matching “thing” inside the parentheses.

It happens that the matching thing inside parentheses in your example is very simple, so it appears to be a redundant result. But there are other regexp use cases where you do need the total result and the components, and the regexp library should not stop supporting those use cases because of a counterintuitive result in a simple case.

hjh

Ok, i see. Then it seems reasonable. Thanks for your explanation.

Concrete example:

// no (), matches the whole expression, no parsing
"abc = 123;".findRegexp("[a-z]+ *= *[0-9]+;")
-> [ [ 0, abc = 123; ] ]

// () for identifier and integer -- 2 submatches, plus the whole match (with ;)
"abc = 123;".findRegexp("([a-z]+) *= *([0-9]+);")
-> [ [ 0, abc = 123; ], [ 0, abc ], [ 6, 123 ] ]

// another layer of () allows the regex to
// a/ confirm that the ; is there, and
// b/ strip the ; in one of the matches
"abc = 123;".findRegexp("(([a-z]+) *= *([0-9]+));")
-> [ [ 0, abc = 123; ], [ 0, abc = 123 ], [ 0, abc ], [ 6, 123 ] ]

hjh

1 Like