Standard ML idiosyncrasies
I've been reading the book Modern Compiler Implementation in ML lately. It's been helpful to brush up on some concepts while developing Tsonnet (my typed-aspiring Jsonnet flavor) and I hope to learn a ton more. However, I'm growing dissatisfied with some details -- not specifically the book, but the choice of the development environment. I've been postponing writing this rant due to some life events, but I'm finally leaving here my impressions, maybe it's helpful to someone, or not... maybe it will only echo in the void. Anyway! The (S)ML experience The first thing I'm not happy with is the lack of a useful getting started guide. You know, the basics: how to compile programs, check standard library documentation, etc. The usual stuff you desperately need when you're starting in a new language. They are available on the website, though you need a few hops to find it, and I don't wanna read a 300ish page tutorial to learn how to do a simple "hello, world". Oh, come on! Loading the REPL with your code is straightforward with sml myprogram.sml. This is fine, but I miss a way of simply running the file directly and getting its result — or better yet, compiling it to a binary (more on this frustration later). The REPL will evaluate the program as you load, but changing and doing it again is rather inconvenient. The SML Environment editor extension was quite useful in improving the feedback loop while coding. If you wanna load the REPL with a more complex code you wrote, like multiple modules, you can do so with sml -m sources.cm -- where sources.cm is the manifest containing the list of all your program sources (the cm extension stands for CMake): sml -m sources.cm Standard ML of New Jersey [Version 110.99.7; 64-bit; December 28, 2024] [scanning sources.cm] [parsing (sources.cm):tiger.lex.sml] [library $SMLNJ-BASIS/basis.cm is stable] [library $SMLNJ-BASIS/(basis.cm):basis-common.cm is stable] [loading (sources.cm):tokens.sig] [loading (sources.cm):tokens.sml] [loading (sources.cm):errormsg.sml] [compiling (sources.cm):tiger.lex.sml] [code: 24261, data: 8005, env: 1319 bytes] [compiling (sources.cm):driver.sml] [code: 1266, data: 26, env: 133 bytes] [loading (sources.cm):main.sml] [New bindings added.] - If you type something wrong, you need to drop characters using backspace -- if you use arrows, well, no arrows for navigation within the REPL: - ^[[D^[[D^[[D^[[D^[[D^[[D^[[D^[[D The REPL is not that helpful. I think I'm spoiled by utop, ghci, irb, and others. Now, if you want to compile your program, you can use ml-build: ml-build sources.cm Main.main tiger.exe Standard ML of New Jersey [Version 110.99.7; 64-bit; December 28, 2024] [scanning sources.cm] [library $SMLNJ-BASIS/basis.cm is stable] [library $SMLNJ-BASIS/(basis.cm):basis-common.cm is stable] [loading (sources.cm):tokens.sig] [loading (sources.cm):tokens.sml] [loading (sources.cm):errormsg.sml] [loading (sources.cm):tiger.lex.sml] [loading (sources.cm):driver.sml] [loading (sources.cm):main.sml] [scanning 73480-export.cm] [scanning (73480-export.cm):sources.cm] [parsing (73480-export.cm):73480-export.sml] [compiling (73480-export.cm):73480-export.sml] [code: 239, data: 31, env: 39 bytes] So far, so good. But here's what happens when you try executing it: chmod +x tiger.exe.amd64-darwin ./tiger.exe.amd64-darwin zsh: exec format error: ./tiger.exe.amd64-darwin Uh oh! Turns out, SML/NJ is compiled to an IR (intermediary representation). To execute it, we need to feed the IR to sml like this: sml @SMLload=tiger.exe test.tig LET 44 TYPE 49 ID(arrtype) 55 EQ 63 ARRAY 65 OF 71 ID(int) 74 VAR 79 ID(arr1) 83 COLON 87 ID(arrtype) 88 ASSIGN 96 ID(arrtype) 99 LBRACE 107 INT(10) 108 RBRACE 110 OF 112 INT(0) 115 IN 117 ID(arr1) 121 END 126 EOF 129 Nothing absurdly wrong with it. It's just not straightforward. As an alternative, MLton can compile directly to binary. It's great in theory! But here's the catch: if there's a library targeting SML/NJ, there might be some inconsistencies and code won't always work out of the box. After some attempts, I reluctantly gave up on MLton to be able to reuse the code from the book without constant tweaking. I'm not happy having to deal with a not ideal environment for learning, but what could I expect, the book is dated now (1995), still, I had higher expectations regarding its basic usage. I wonder if I'd have chosen the C or Java version would be better -- this is on me, I'm an eternal believer in the functional style and assumed it would be nicer to use (S)ML rather than C or Java. At this point, I seriously considered: why not switch to OCaml? It's a more polished flavor of ML with plenty of modern tools and libraries. I could figure out the tooling differences myself. However, since I plan to continue with the book's more advanced exercises, I've decided to stick with SML for now. Oth

I've been reading the book Modern Compiler Implementation in ML lately. It's been helpful to brush up on some concepts while developing Tsonnet (my typed-aspiring Jsonnet flavor) and I hope to learn a ton more. However, I'm growing dissatisfied with some details -- not specifically the book, but the choice of the development environment.
I've been postponing writing this rant due to some life events, but I'm finally leaving here my impressions, maybe it's helpful to someone, or not... maybe it will only echo in the void. Anyway!
The (S)ML experience
The first thing I'm not happy with is the lack of a useful getting started guide. You know, the basics: how to compile programs, check standard library documentation, etc. The usual stuff you desperately need when you're starting in a new language. They are available on the website, though you need a few hops to find it, and I don't wanna read a 300ish page tutorial to learn how to do a simple "hello, world". Oh, come on!
Loading the REPL with your code is straightforward with sml myprogram.sml
. This is fine, but I miss a way of simply running the file directly and getting its result — or better yet, compiling it to a binary (more on this frustration later). The REPL will evaluate the program as you load, but changing and doing it again is rather inconvenient.
The SML Environment editor extension was quite useful in improving the feedback loop while coding.
If you wanna load the REPL with a more complex code you wrote, like multiple modules, you can do so with sml -m sources.cm
-- where sources.cm
is the manifest containing the list of all your program sources (the cm extension stands for CMake):
sml -m sources.cm
Standard ML of New Jersey [Version 110.99.7; 64-bit; December 28, 2024]
[scanning sources.cm]
[parsing (sources.cm):tiger.lex.sml]
[library $SMLNJ-BASIS/basis.cm is stable]
[library $SMLNJ-BASIS/(basis.cm):basis-common.cm is stable]
[loading (sources.cm):tokens.sig]
[loading (sources.cm):tokens.sml]
[loading (sources.cm):errormsg.sml]
[compiling (sources.cm):tiger.lex.sml]
[code: 24261, data: 8005, env: 1319 bytes]
[compiling (sources.cm):driver.sml]
[code: 1266, data: 26, env: 133 bytes]
[loading (sources.cm):main.sml]
[New bindings added.]
-
If you type something wrong, you need to drop characters using backspace -- if you use arrows, well, no arrows for navigation within the REPL:
- ^[[D^[[D^[[D^[[D^[[D^[[D^[[D^[[D
The REPL is not that helpful. I think I'm spoiled by utop, ghci, irb, and others.
Now, if you want to compile your program, you can use ml-build
:
ml-build sources.cm Main.main tiger.exe
Standard ML of New Jersey [Version 110.99.7; 64-bit; December 28, 2024]
[scanning sources.cm]
[library $SMLNJ-BASIS/basis.cm is stable]
[library $SMLNJ-BASIS/(basis.cm):basis-common.cm is stable]
[loading (sources.cm):tokens.sig]
[loading (sources.cm):tokens.sml]
[loading (sources.cm):errormsg.sml]
[loading (sources.cm):tiger.lex.sml]
[loading (sources.cm):driver.sml]
[loading (sources.cm):main.sml]
[scanning 73480-export.cm]
[scanning (73480-export.cm):sources.cm]
[parsing (73480-export.cm):73480-export.sml]
[compiling (73480-export.cm):73480-export.sml]
[code: 239, data: 31, env: 39 bytes]
So far, so good. But here's what happens when you try executing it:
chmod +x tiger.exe.amd64-darwin
./tiger.exe.amd64-darwin
zsh: exec format error: ./tiger.exe.amd64-darwin
Uh oh!
Turns out, SML/NJ is compiled to an IR (intermediary representation).
To execute it, we need to feed the IR to sml
like this:
sml @SMLload=tiger.exe test.tig
LET 44
TYPE 49
ID(arrtype) 55
EQ 63
ARRAY 65
OF 71
ID(int) 74
VAR 79
ID(arr1) 83
COLON 87
ID(arrtype) 88
ASSIGN 96
ID(arrtype) 99
LBRACE 107
INT(10) 108
RBRACE 110
OF 112
INT(0) 115
IN 117
ID(arr1) 121
END 126
EOF 129
Nothing absurdly wrong with it. It's just not straightforward.
As an alternative, MLton can compile directly to binary. It's great in theory! But here's the catch: if there's a library targeting SML/NJ, there might be some inconsistencies and code won't always work out of the box. After some attempts, I reluctantly gave up on MLton to be able to reuse the code from the book without constant tweaking.
I'm not happy having to deal with a not ideal environment for learning, but what could I expect, the book is dated now (1995), still, I had higher expectations regarding its basic usage. I wonder if I'd have chosen the C or Java version would be better -- this is on me, I'm an eternal believer in the functional style and assumed it would be nicer to use (S)ML rather than C or Java.
At this point, I seriously considered: why not switch to OCaml? It's a more polished flavor of ML with plenty of modern tools and libraries. I could figure out the tooling differences myself. However, since I plan to continue with the book's more advanced exercises, I've decided to stick with SML for now. Otherwise, I won't be able to use the specific tools from the book — a decision I'm already wondering if I'll regret.
Speaking of which, I had to implement a lexer as an exercise, using ML-Lex. I felt dumb through the trial and error process, experimenting with the sample code on the documentation page, which leaves much to be desired to say the least. Good thing that nowadays we have LLMs, and Claude came to the rescue -- it was a relief having an AI pairing pal to guide me on some basic stuff, instead of extending the frustrating moments.
In the book, we build a lexer a language called Tiger. Here's the lexer I wrote:
type pos = int
type lexresult = Tokens.token
val lineNum = ErrorMsg.lineNum
val linePos = ErrorMsg.linePos
val commentLevel = ref 0;
val charbuf : string list ref = ref [];
val charBegin = ref 0;
fun err(p1,p2) = ErrorMsg.error p1
fun eof() = let val pos = hd(!linePos) in Tokens.EOF(pos,pos) end
fun textToInt str =
case Int.fromString str of
SOME n => n
| NONE => raise Fail ("Cannot convert '" ^ str ^ "' to integer");
%%
%s COMMENT;
%s STRING;
whitespace = [\ \t];
breakline = [\n \r\n];
alpha = [A-Za-z];
digit = [0-9]+;
id = [a-zA-Z_][a-zA-Z_0-9]*;
%%
"/*" => (commentLevel := 1; YYBEGIN COMMENT; continue());
"/*" => (commentLevel := !commentLevel + 1; continue());
"*/" => (commentLevel := !commentLevel - 1;
if !commentLevel = 0 then YYBEGIN INITIAL else ();
continue());
{breakline} => (lineNum := !lineNum+1; linePos := yypos :: !linePos; continue());
. => (continue());
{breakline} => (lineNum := !lineNum+1; linePos := yypos :: !linePos; continue());
{whitespace}+ => (continue());
"," => (Tokens.COMMA(yypos,yypos+1));
":" => (Tokens.COLON(yypos,yypos+1));
";" => (Tokens.SEMICOLON(yypos,yypos+1));
"+" => (Tokens.PLUS(yypos,yypos+1));
"-" => (Tokens.MINUS(yypos,yypos+1));
"*" => (Tokens.TIMES(yypos,yypos+1));
"/" => (Tokens.DIVIDE(yypos,yypos+1));
"=" => (Tokens.EQ(yypos,yypos+1));
"<>" => (Tokens.NEQ(yypos,yypos+2));
"." => (Tokens.DOT(yypos,yypos+1));
":=" => (Tokens.ASSIGN(yypos,yypos+2));
">" => (Tokens.GT(yypos,yypos+1));
">=" => (Tokens.GE(yypos,yypos+2));
"<" => (Tokens.LT(yypos,yypos+1));
"<=" => (Tokens.LE(yypos,yypos+2));
"|" => (Tokens.OR(yypos,yypos+1));
"&" => (Tokens.AND(yypos,yypos+1));
"[" => (Tokens.LBRACE(yypos,yypos+1));
"]" => (Tokens.RBRACE(yypos,yypos+1));
"{" => (Tokens.LBRACK(yypos,yypos+1));
"}" => (Tokens.RBRACK(yypos,yypos+1));
"(" => (Tokens.LPAREN(yypos,yypos+1));
")" => (Tokens.RPAREN(yypos,yypos+1));
"type" => (Tokens.TYPE(yypos,yypos+4));
"var" => (Tokens.VAR(yypos,yypos+3));
"function" => (Tokens.FUNCTION(yypos,yypos+8));
"break" => (Tokens.BREAK(yypos,yypos+5));
"of" => (Tokens.OF(yypos,yypos+2));
"end" => (Tokens.END(yypos,yypos+3));
"in" => (Tokens.IN(yypos,yypos+2));
"nil" => (Tokens.NIL(yypos,yypos+3));
"let" => (Tokens.LET(yypos,yypos+3));
"do" => (Tokens.DO(yypos,yypos+2));
"to" => (Tokens.TO(yypos,yypos+2));
"for" => (Tokens.FOR(yypos,yypos+3));
"while" => (Tokens.WHILE(yypos,yypos+5));
"else" => (Tokens.ELSE(yypos,yypos+4));
"then" => (Tokens.THEN(yypos,yypos+4));
"if" => (Tokens.IF(yypos,yypos+2));
"array" => (Tokens.ARRAY(yypos,yypos+5));
{digit} => (Tokens.INT(textToInt yytext,yypos,yypos+(size yytext)));
{id} => (Tokens.ID(yytext,yypos,yypos+(size yytext)));
\ " => (YYBEGIN STRING; charbuf := []; charBegin := yypos; continue());
[^ \"] => (charbuf := yytext :: !charbuf; continue());
\" => (YYBEGIN INITIAL;
Tokens.STRING(String.concat(List.rev(!charbuf)), !charBegin, yypos+1));
. => (ErrorMsg.error yypos (" illegal character " ^ yytext); continue());
The ML-Lex documentation is rather cryptic when it comes to dealing with states, how to use it, and so on. Lucky me I was able to hook a piece of information from chapter 2 talking about how automata are suitable for implementing lexers to figure it out, and then came up with COMMENT
and STRING
. I wonder how many people gave up by just trying the "official" documentation and getting frustrated. I'd probably give up without the extra external context.
Conclusion
I've rambled enough for one post. I may write more about my ML adventures... or maybe not. For now, I have a parser to write. Wish me luck!
Thanks for reading Bit Maybe Wise! Subscribe for free to receive new posts — no cryptic documentation or frustrating REPLs here, just compiler insights delivered straight to your inbox.
Photo by JESHOOTS.COM on Unsplash