I was using treetop to do some parsing the other day and it got me thinking. Treetop is a parsing DSL for ruby based on the idea of a parsing expression grammar. This could get dangerous.
lex and yacc (flex and bison)
If you open up the ruby source code, you'll probably find a file named
parse.y
. This is probably used by
bison to generate a parser.
The parser is (probably) used in conjuction with
flex to deal with
parsing things. A thing could be a source file of a programming
language, an HTTP request, or some other interesting file format.
Using this type of system is great because it's (usually) damn fast, and you can provide decent error reporting, as opposed to something like regular expressions. Not to mention, regular expressions won't work on things that aren't regular.
The Process (Simplified)
So why all this? That way of parsing things, with a lexer (generated by flex) and a parser (generated by bison), goes like this:
- Lexer gets setup with an input.
- Parser is setup with lexer as input.
- Parser ask the lexer, "what's the next token?"
- The lexer uses it's rules to match a token (typically eating whitespace).
- Parser starts matching rules with that token.
- Go back to 3 until the input is done (or the parser says it's done).
Basically.
It's a bit more complicated than that, but for our purposes, we can leave it there. The point is that the lexer typically ignores whitespace between tokens. Who cares if you have one space or twenty beween a type declaration and the name of the variable? The system doesn't care.
Python
Now python cares a little bit. Python is partially
whitespace-sensitive, which in this case means that it pays attention
to your indentation. Python uses indentation to denote blocks. If you
make an if
statement, and you indent the first line by 4 spaces (let's
say), you just indent the next line by 4 spaces as well to specify
that line as being a part of that if
statement as well. If you don't
indent it, it's not part of the if statement. If you indent by some
other value (3 spaces), it complains because you aren't being consistent
and has no idea what the hell you're talking about. No more curly brace
madness with your if statements!
Let's crank it up to 11
Since a PEG system is "different" than the system I previously described, we can do different things with it. You write your PEG to recognize the text exactly as it is, so to recognize an if statement you'd do something like this:
rule if_start: 'if' space lparen if_body rparen
This would not match an if statement with two spaces between the
if
token and the left paren.
Lack of research
I'm not going to lie, I haven't researched this to the four corners of the earth. I don't know whether or not the traditional system could be made to work to the degree I am thinking. But that doesn't really matter, it's just the idea of it.
Let the compiler do it
My 1+1 was if everybody whines about programming style,1 and there is this parsing system that requires you to specify the text exactly as it should be, why not combine the two and just enforce programming style in the language grammar itself? If you screw up the style, it doesn't compile!
You can turn the dial a bit, to make some parts of the syntax more flexible than others. As an example, you could enforce single spaces between things, but not indentation rules (like python).
Pros
The best part is nobody can come in and start throwing extra spaces around, mixing tabs in there, and just generally muck about. If they don't follow the style that is laid out as a part of the language, it doesn't work. All code looks the same.
Second…well I guess that's about it. It really just helps keep things consistent and under control. That is, however, a pretty big pro in my mind.
Cons
A couple people argued that style is subjective. Well…maybe. That's really not enough of an argument to convince me. Yes, it is, sort of, but I'd much rather see consistent code than be able to write exactly the way I want.
Another argument was that evolving the style would be painful. No more painful than upgrading or changing language features, really. If you change the style of something which results in an upgrade barrier for some (they have to change their code to upgrade to the latest version of the language), how is this any different than ruby 1.9 or python 3? They each had breaking changes in the language spec which required work to upgrade. It's not the end of the world.
Why not?
So why not? Regardless of what parsing system you use, why not enforce the language style in the language itself? It seems like it could be a pretty good (or at least interesting) idea.2
-
Recently I've been having many conversations with friends about programming style problems we've seen.
-
At a minimum, it looks good on paper, and would be interesting from an academic perspective if it proved to be a bad idea in the real world.