I better say it up front, because it will quickly become obvious. I am not a computer science graduate. I have never written a compiler. It was quite a route to get my thinking in line with this particular problem and I’m sure it will evolve further.
As I said in my post on Friday, one part of my solving the VB/C# problem without making unreadable templates is a preprocessor. I struggled with what to call it because the real pattern is – create the VB templates, run the processor to create the C# templates, execute the C# or VB templates. So is it really a pre processor? I am still calling it that as it is before running the C# templates so I ‘m thinking of it as an optional pre-processor.
The result is modified templates. A second set of source code and a second template assembly.
The first decision I faced was how much context I was going to demand for any decision. More context, more sophisticated decisions. You could attempt to build a full syntactic tool that understands the structure of your output code and knows a great deal about what you are accomplishing. This may or may not be possible, and will certainly require restrictions on what template code is legal because evaluating multiple paths will be a nightmare and stray strings can result in legal templates, but won’t provide the same evaluation. You may be able to; I’m choosing not to tackle that and decided on least possible context.
The absolute minimum of understanding about the template being converted is which of a finite set of states you are in. Possible states are:
- Template logic (the code that runs the template)
- Code blocks
- Conditional blocks within code blocks
- Possibly additional states around declarations, for loops and using statements
My first attempt was line based. Faster, easier to recognize comments and nearly impossible to ever restructure the line wrapping correctly. Trust me, that route did not go well.
A week ago Friday, nearly in tears, I told my son Ben “Look, I told everybody I could do this, and Carl just posted that show. And I am doing it, except I think the bugs I am facing with end of line issues are not solvable.”
My brilliant son said “Why on earth are you doing it that way – do character by character.”
“What, rewrite the whole thing?” maybe I cried.
The rewrite actually went pretty well, painful as it was to abandon nearly completed code. It was made easier by the fact I really do not care about performance. This is a template translation. The converted templates will be compiled and blazingly fast. I can take a second or two a template to do the translation. Thus I can skip all that compiler theory that I never learned about managing buffers and look aheads and all that. A bit of brute force with the simplest possible RegEx.
I’m basically looking at the entire template as a string. I step character by character through the string doing a substring check starting at the current position. I avoid the dumbest of the .NET mistakes such as copying the substrings unnecessarily and I do concatenate via a string builder so performance doesn’t suck too badly. And I do restrict what I’m looking for to what makes sense in context. But I don’t worry that I am looking at the next handful of characters an excessive number of times.
I start off in the template logic. I output the template logic character by character until I find a character sequence that indicates a new mode. I’m keeping this simple by managing both the modes and the required stack via the call stack. Meaning, when I shift into a new mode, such as the Comment mode I just call a method called TranslateComment. Comments are easy - just change the start character and read to the end of the line for output. I need comments treated differently because a code block in a comment should not be translated.
For now, I’m making the restriction that code blocks – blocks to output – must be exactly <code>stuff</code>. This makes parsing a bit easier than allowing any element name. If I’m in template logic and hit a code block, I know I need to start translating. I start looking for sequences that need conversion Me as a word, If, For Each, End If, Next, etc. This list is pretty short right now, I expect the preprocessor to evolve.
If I’m in a code block and I hit an embedded expression (<%= ) I switch back to template logic mode. This is not precise but its close enough. Characters are output exactly until I hit another code block because this is template logic, not output code. If you concatenate strings in there, you’re toast, but you can call methods that are in the VB/C# namespaces.
There are some special cases around code constructs. I recognize an If block by searching for the Then and taking what’s between as a code expression that needs translation. Wherever I’m translating expressions I just use a simple replacement because it’s really just separate symbols.
The preprocessor is simple and focused on what’s actually needed, not boiling the ocean. It will evolve as far as it needs to, staying well shy of both the power and usability issues of the CodeDOM – we just don’t need that for business templates in VB and C#.
Whew! I could write tons more on glitch little details of this preprocessor that’s really eaten my last couple of weeks. It’s one of the pieces I want to get Open Source early on.