Off-topic: Struggles with LPEG grammar
Hi, I'm sorry for being slightly off-topic here, but this list might still be the best place to resolve lpeg-related questions :) 0.) Disclaimer: the challenge that triggered this curiosity came from Advent of Code 2020. In case you are taking part and you wan't to avoid spoilers, please stop reading here! (You have been warned.) https://adventofcode.com/2020/day/19 1.) My question: I don't understand why I cannot get ^1 to work "as advertised". Isn't this supposed to mean "one or more occurences of the pattern"? If I change "lpeg.P('b')" into "lpeg.P('b')^1" in the example below, the strings that match the initial grammar no longer match the modified grammar. (I would naively imagine that the secord pattern would get more rather than less matches.) 2.) Background: Most definitely the task on that page is supposed to be solved in a different way, but many people use Advent of Code as an opportunity to learn a new programming language, and when I read the task description, I wanted to figure out if I could solve it using the cute little lpeg. My initial attempt worked correctly (at least to solve the first puzzle), but then I realized that I cannot easily change the pattern from "matches a letter b" into "matches any number of b-s", and I fail to figure out why. Any hints would be greatly appreciated. Below is a not-so-minimal example. I can certainly try to reduce it further, but I would first like to ask whether I'm doing something obviously wrong by trying to replace r5 = lpeg.P('b') by r5 = lpeg.P('b')^1 in order to allow more than one occurrences of the letter b? My only explanation would be that perhaps "^1" is so greedy that the rest of the pattern doesn't get found. But I don't want to believe that explanation. local lpeg = require "lpeg" --[[ 0: 4 1 5 1: 2 3 | 3 2 2: 4 4 | 5 5 3: 4 5 | 5 4 4: "a" 5: "b" ]]-- local parser = lpeg.P{ "r0"; r0 = lpeg.V"r4" * lpeg.V"r1" * lpeg.V"r5", r1 = lpeg.V"r2" * lpeg.V"r3" + lpeg.V"r3" * lpeg.V"r2", r2 = lpeg.V"r4" * lpeg.V"r4" + lpeg.V"r5" * lpeg.V"r5", r3 = lpeg.V"r4" * lpeg.V"r5" + lpeg.V"r5" * lpeg.V"r4", r4 = lpeg.P('a'), r5 = lpeg.P('b'), } * -1 local parser1 = lpeg.P{ "r0"; r0 = lpeg.V"r4" * lpeg.V"r1" * lpeg.V"r5", r1 = lpeg.V"r2" * lpeg.V"r3" + lpeg.V"r3" * lpeg.V"r2", r2 = lpeg.V"r4" * lpeg.V"r4" + lpeg.V"r5" * lpeg.V"r5", r3 = lpeg.V"r4" * lpeg.V"r5" + lpeg.V"r5" * lpeg.V"r4", r4 = lpeg.P('a'), r5 = lpeg.P('b')^1, -- modified part that doesn't seem to work } * -1 strings = { "ababbb", "bababa", "abbbab", "aaabbb", "aaaabbb", }; local total = 0 local total1 = 0 for _, s in ipairs(strings) do if lpeg.match(parser, s) then total = total + 1 end if lpeg.match(parser1, s) then total = total + 1 end end print('total:', total, total1) In this example, total=2, total1=0. What I don't understand is why total1 is zero. Thank you, Mojca
On 21 Dec 2020, at 13:16, Mojca Miklavec
wrote: My only explanation would be that perhaps "^1" is so greedy that the rest of the pattern doesn't get found. But I don't want to believe that explanation.
Which (of course) means that that is exactly what happens ;) The ones that match are ababbb (a (ba+bb) b) => r4 r1(r3(r5 r4) r2(r5 r5)) r5 abbbab (a (bb+ba) b) => r4 r1(r2(r5 r5) r3(r5 r4)) r5 With the ^1, in the “bb” cases the first “b” eats all three “b”s: ababbb fails the r5 at the end abbbab fails the first r2 already (since the second r5 therein never happens) Best wishes, Taco
On 21 Dec 2020, at 13:46, Taco Hoekwater
wrote: On 21 Dec 2020, at 13:16, Mojca Miklavec
wrote: My only explanation would be that perhaps "^1" is so greedy that the rest of the pattern doesn't get found. But I don't want to believe that explanation.
Which (of course) means that that is exactly what happens ;)
The ones that match are
ababbb (a (ba+bb) b) => r4 r1(r3(r5 r4) r2(r5 r5)) r5 abbbab (a (bb+ba) b) => r4 r1(r2(r5 r5) r3(r5 r4)) r5
With the ^1, in the “bb” cases the first “b” eats all three “b”s:
ababbb fails the r5 at the end
Sorry, that was wrong, it fails at the second r5 in the r2 as well, for the same reason as below.
abbbab fails the first r2 already (since the second r5 therein never happens)
Best wishes, Taco
Dear Taco, On Mon, 21 Dec 2020 at 13:46, Taco Hoekwater wrote:
On 21 Dec 2020, at 13:16, Mojca Miklavec wrote:
My only explanation would be that perhaps "^1" is so greedy that the rest of the pattern doesn't get found. But I don't want to believe that explanation.
Which (of course) means that that is exactly what happens ;)
The ones that match are
ababbb (a (ba+bb) b) => r4 r1(r3(r5 r4) r2(r5 r5)) r5 abbbab (a (bb+ba) b) => r4 r1(r2(r5 r5) r3(r5 r4)) r5
With the ^1, in the “bb” cases the first “b” eats all three “b”s:
ababbb fails the r5 at the end
abbbab fails the first r2 already (since the second r5 therein never happens)
Is this a deliberate choice, a limitation of the grammar expressiveness, some misuse on my side that could/should/needs to be implemented in a different way, or does it count as a "bug" on the lpeg side? For example, I wouldn't expect a regexp "b+b" to fail on "bbb" just because "b+" would eat all three "b"s at once (the regexp "b+b" in fact finds "bbb", and I would expect a less-than-totally-greedy hit with lpeg as well). Or is my reasoning wrong here? It certainly works if I use lpeg.P('b') + lpeg.P('bb') + lpeg.P('bbb') -- and a couple more (as long as I can predict the maximum length) but that's not really a viable workaround in general. Thank you, Mojca PS: sorry, a tiny bug also crippled into my sample code. The line after matching the 'parser1' should have used 'total1' rather than 'total': if lpeg.match(parser1, s) then total1 = total1 + 1 end
On 21 Dec 2020, at 14:08, Mojca Miklavec
wrote: Dear Taco,
On Mon, 21 Dec 2020 at 13:46, Taco Hoekwater wrote:
On 21 Dec 2020, at 13:16, Mojca Miklavec wrote:
My only explanation would be that perhaps "^1" is so greedy that the rest of the pattern doesn't get found. But I don't want to believe that explanation.
Which (of course) means that that is exactly what happens ;)
The ones that match are
ababbb (a (ba+bb) b) => r4 r1(r3(r5 r4) r2(r5 r5)) r5 abbbab (a (bb+ba) b) => r4 r1(r2(r5 r5) r3(r5 r4)) r5
With the ^1, in the “bb” cases the first “b” eats all three “b”s:
ababbb fails the r5 at the end
abbbab fails the first r2 already (since the second r5 therein never happens)
Is this a deliberate choice, a limitation of the grammar expressiveness, some misuse on my side that could/should/needs to be implemented in a different way, or does it count as a "bug" on the lpeg side?
For example, I wouldn't expect a regexp "b+b" to fail on "bbb" just because "b+" would eat all three "b"s at once (the regexp "b+b" in fact finds "bbb", and I would expect a less-than-totally-greedy hit with lpeg as well). Or is my reasoning wrong here?
PEGs are greedy by design, which is a consequence of the fact that PEGS do not backtrack, which goes back to the underlying assumptive rule of PEGs that there is one (and only one!) ‘correct’ way to parse the input. Allowing backtracking destroys that assumption and by doing so would complicate the system to a level that would make it comparable to PCRE (with all the associated penalties on processing speed and a much greater codebase). Another way of thinking of it (perhaps that helps): PEGs are _declarative_ on the input, whereas REs are _interpretive_. Yet another way of thinking about it: PEGs rigorously define the (programming) language of the input. It takes a bit of rewiring your brain to get comfortable with PEGs when you are used to REs, and unpredictable input is much harder to tackle in PEGs. OTOH, using PEGs are much better if there is an explicit grammar. In your example, the fact that you even considered ^1 means that you were still thinking too much in terms of regular expressions, but I know it is very hard to learn how to do something that is _only a little_ different from something you know well already. Best wishes, Taco
On Mon, Dec 21, 2020 at 2:36 PM Taco Hoekwater
On 21 Dec 2020, at 14:08, Mojca Miklavec
wrote: Dear Taco,
On Mon, 21 Dec 2020 at 13:46, Taco Hoekwater wrote:
On 21 Dec 2020, at 13:16, Mojca Miklavec wrote:
My only explanation would be that perhaps "^1" is so greedy that the rest of the pattern doesn't get found. But I don't want to believe that explanation.
Which (of course) means that that is exactly what happens ;)
The ones that match are
ababbb (a (ba+bb) b) => r4 r1(r3(r5 r4) r2(r5 r5)) r5 abbbab (a (bb+ba) b) => r4 r1(r2(r5 r5) r3(r5 r4)) r5
With the ^1, in the “bb” cases the first “b” eats all three “b”s:
ababbb fails the r5 at the end
abbbab fails the first r2 already (since the second r5 therein never happens)
Is this a deliberate choice, a limitation of the grammar expressiveness, some misuse on my side that could/should/needs to be implemented in a different way, or does it count as a "bug" on the lpeg side?
For example, I wouldn't expect a regexp "b+b" to fail on "bbb" just because "b+" would eat all three "b"s at once (the regexp "b+b" in fact finds "bbb", and I would expect a less-than-totally-greedy hit with lpeg as well). Or is my reasoning wrong here?
PEGs are greedy by design, which is a consequence of the fact that PEGS do not backtrack, which goes back to the underlying assumptive rule of PEGs that there is one (and only one!) ‘correct’ way to parse the input. Allowing backtracking destroys that assumption and by doing so would complicate the system to a level that would make it comparable to PCRE (with all the associated penalties on processing speed and a much greater codebase).
greedy vs non-greedy is one of the things that I always keep in mind when I start with lpeg, and regularly I fail to apply -- because I think in the "perl regex way". Anyway, http://www.gammon.com.au/lpeg has some good lines: e.g. this one (from the lpeg site) find the pattern anywhere in the line: function anywhere (p) return lpeg.P { p + 1 * lpeg.V(1) } end print (lpeg.match (anywhere ("dog"), target)) -- luigi
Hi, Thanks to both Taco and Arthur for clarifying this and pointing out the difference between PEG and PCRE. I'll put some links like https://en.wikipedia.org/wiki/Parsing_expression_grammar https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions on my reading list and try to understand it more thoroughly than I do at the moment. After that I'll probably need to re-read Taco's answer a couple more times, and probably write some parsing grammar as an exercise, but I'll eventually get there :) Just a correction of my previous statement. The following doesn't really work either: lpeg.P('b') + lpeg.P('bb') + lpeg.P('bbb') it just happened to work in one particular case on my minimal example, but it doesn't help in general. That's probably expected based on what Arthur and Taco explained to me.
In your example, the fact that you even considered ^1 means that you were still thinking too much in terms of regular expressions,
In reality it just means that I was trying to add a new rule to solve the second part of the puzzle (hidden on the website until you solve the first part), which read something like 6: 3 | 3 6 which would in theory be translated into something like r6 = lpeg.V"r3" + lpeg.V"r3" * lpeg.V"r6", if PEGs worked the way I imagined they worked, that is. Apparently they don't :) It seemed so obvious, but it would be too easy if it was true ;), so it will force me to solve it "ab initio" (as it was initially meant) anyway. Thanks a lot to everyone for the extra patience to explain me those basic principles & differences, Mojca
Hi, Also good reading are the first sections in Roberto’s paper: http://www.inf.puc-rio.br/~roberto/docs/peg.pdf
In reality it just means that I was trying to add a new rule to solve the second part of the puzzle (hidden on the website until you solve the first part), which read something like 6: 3 | 3 6 which would in theory be translated into something like r6 = lpeg.V"r3" + lpeg.V"r3" * lpeg.V"r6", if PEGs worked the way I imagined they worked, that is. Apparently they don't :)
Try this, blind guess: r6 = lpeg.V"r3" * lpeg.V”r6” + lpeg.V"r3", Best wishes, Taco Taco Hoekwater Elvenkind BV
On 12/21/2020 3:10 PM, Mojca Miklavec wrote:
lpeg.P('b') + lpeg.P('bb') + lpeg.P('bbb')
always put the longest (for overlapping) first, in this case you could do: P('b')^-3 at most 3 ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
On Mon, Dec 21, 2020 at 02:08:11PM +0100, Mojca Miklavec wrote:
Is this a deliberate choice, a limitation of the grammar expressiveness, some misuse on my side that could/should/needs to be implemented in a different way, or does it count as a "bug" on the lpeg side?
That’s definitely the way parsing expression grammars work. Or rather: that’s the way LPeg (the only parsing expression grammar) works :-)
For example, I wouldn't expect a regexp "b+b" to fail on "bbb" just because "b+" would eat all three "b"s at once (the regexp "b+b" in fact finds "bbb", and I would expect a less-than-totally-greedy hit with lpeg as well). Or is my reasoning wrong here?
Your reasoning is correct, but applies to regular expressions, not parsing-expression grammars. Arthur
I'm sorry for being slightly off-topic here, but this list might still be the best place to resolve lpeg-related questions :) spoiler attached ... doesn't deserve a beauty contest price not one for
On 12/21/2020 1:16 PM, Mojca Miklavec wrote: performence (not much different from your btw) but maybe easier for testing all kind of variants (i cannot use context helpers of course) Hans ----------------------------------------------------------------- Hans Hagen | PRAGMA ADE Ridderstraat 27 | 8061 GH Hasselt | The Netherlands tel: 038 477 53 69 | www.pragma-ade.nl | www.pragma-pod.nl -----------------------------------------------------------------
participants (5)
-
Arthur Reutenauer
-
Hans Hagen
-
luigi scarso
-
Mojca Miklavec
-
Taco Hoekwater