Librarian of Alexandria


Extending the Standard Streams

If I write my own shell—which I may very well do at some point—there's a particular process model I'd like to embed in it. To wit: in UNIX right now, each program in a pipeline has a single input stream and two output streams, with files/sockets/&c for other kinds of communication.

A pipeline of functions foo | bar | baz looks kind of like this

keyboard -> stdin\   /stdout -> stdin\   /stdout -> stdin\   /stdout —+
                  foo                 bar                 baz           |
                     \stderr -+          \stderr -+          \stderr -+ |
                              |                   |                   | |
                              v                   v                   v v

Which works pretty well. You can do some pretty nice things with redirecting stderr to here and stdin from there and so forth, and it enables some nice terse shell invocations.

I'd like that basic system to be preserved, but with the ability to easily create other named streams. For example, imagine a hypothetical version of wc which still outputs the relevant data to stdout, but also has three other streams with these names:

                 / newlines
                 | words
stdin -> wc-s -> | bytes
                 | stdout
                 \ stderr

You can always see the normal output of wc on stdout:

gdsh$ wc-s *
       2       3       6 this.txt
      10      20      30 that.c
     100    1000   10000

But you could also extract an individual stream from that invocation using special redirection operators:

gdsh$ wc-s * stdout>/dev/null bytes>&stdout

We could also have multiple input channels. I imagine an fmt command which can interpolate named streams, e.g.

gdsh$ printf "1\n2 3\n" | wc-s | fmt "bytes: {bytes}\n words: {words}\n nl: {newlines}\n"
bytes: 6
words: 3
newlines: 2

We can then have a handful of other utilities and built-in shell operators for manipulating these other streams:

  # the `select` command takes a stream name and outputs it
gdsh$ wc-s * | select words
  # here we redirect stdout to the stream X and pass it to fmt
gdsh$ cat this.txt stdout>&X | fmt "this is {X}\n"
this is 1
2 3
  # the same, using file redirection operators
gdsh$ fmt "this is {X}\n" X<this.txt
this is 1
2 3
  # the same, using a shorthand for setting up a stream by taking
  # the stdout from some command
gdsh$ !X='cat this.txt' fmt "this is {X}\n"
this is 1
2 3
  # the same, using a shorthand for setting up a stream by just
  # reading and outputting a file
gdsh$ @X=this.txt fmt "this is {X}\n"
this is 1
2 3
  # using a shorthand for filling in a stream with a string directly
gdsh$ ^Y=recidivism fmt "Y is {Y}\n"
Y is recidivism
  # redirecting each output stream to a different file
gdsh$ wc-s * words>words.txt bytes>bytes.txt newlines>newlines.txt
  # using a SmallTalk-like quoting mechanism to apply different shell
  # commands to different streams
gdsh$ wc -s * | split words=[sort >sorted-word-count.txt] bytes=[uniq >uniq-bytes.txt]

This could also enable new idioms for programs and utilities. For example, verbose output, rather than being controlled by a flag to the program, could be always output to a (possibly unused) stream called verbose, so the verbose output could be seen by redirecting the verbose stream (or by logging the verbose output while only seeing the typical stderr messages):

  # here we only see stderr
gdsh$ myprog
myprog: config file not found
  # here we ignore stderr and see only the verbose output
gdsh$ myprog stderr>/dev/null verbose>&stderr
Setting up context
Looking in user dir... NOT FOUND
Looking in global dir... NOT FOUND
myprog: file not found
Tearing down context
  # here we see stderr but logg the verbose output
gdsh$ myprof verbose>errmsgs
myporog: config file not found

Or maybe you could have human-readable error messages on stderr and machine-readable error messages on jsonerr:

  # here is a human-readable error message
gdsh$ thatprog
ERROR: no filename given
  # here is a machine-readable error message
gdsh$ thatprog stderr>/dev/null jsonerr>stderr
{"error-type":"fatal","error-code":30,"error-msg":"no filename given"}

Or you could have a program which takes in data on one stream and commands on another:

  # someprog takes in raw data on the stream DATA, and commands
  # on the stream CMDS. Here we take the data from a local file
  # and accept commands from the network:
gdsh$ @DATA=file.dat !CMDS='nc -l 8000' someprog
  # ...and here we have a set of commands we run through locally
  # while taking data from the network:
gdsh$ !DATA='nc -l 8001' @CMDS=cmds.txt someprog

There are other considerations I've glossed over here, but here are a few notes, advantages, and interactions:

  • I glossed over the distinction between input/output streams. In practice, the shell has no trouble disambiguating the two, but a given program may wish to consider the distinction between and words.out; to this end, we could rename the existing streams and std.out and err.out (it being an error to read from in most cases.1)

  • This is obviously not POSIX-compliant, but could be made to work with the existing UNIX process model by e.g. having a standard environment variable for stream-aware programs to look at which maps stream names to file descriptors. That way, programs which don't expect these special streams still use fds 0, 1, and 2 as expected, while programs that do handle these can read STREAM_DESC to find out which 'streams' correspond to which file descriptors. In that case, you can almost use these commands with an existing shell by doing something like

    sh$ echo foo >&3 | STREAM_DESC='' fmt "foo is {foo}\n"

    where STREAM_DESC takes the form of streamname:fd pairs separated by spaces.

  • If we do use an existing UNIX system to write this, then we also should integrate libraries for this, and the API for it is unknown. Presumably a C interface would have int get_stream(char* stream_name) that could return -1 on failure. get_stream would look through the STREAM_DESC environment variable to find the relevant stream and return the fd mentioned there, otherwise failing. This does mean that you have to create your streams before you use them, which I think is reasonable.

  • This would interact really interestingly with a semi-graphical shell2 that could visualize the stream relationships between commands as well as a shell with a higher-level understanding of data types.3

So those are some ideas that have been drifting around in my head for a while. No idea if I'll ever implement any of them, or if they'd even be worth implementing, but I might get around to it at some point. We'll see.

  1. I originally figured that would be a useless stream, but after some thought, I can imagine a use for this. Let's say my programming language of choice, Phosphorus, outputs its error messages in XML format. This is great for an IDE, but now I need to debug my program on a remote server which doesn't have my IDE installed. I could have a program ph-wrapper that passes all streams through unchanged except for, which it parses as XML and then processes to a kind of pretty-printed trace representation and passes it to its own err.out. So

    gdsh$ phosphorus
    Setting up program...
    gdsh$ phosphorus | ph-wrapper
    Setting up program...
    NoSuchIndex exception on line 3:
      x = args[3];

    So yes, I can imagine a class of programs which want to pay attention to

  2. Don't cringe. Look—the input device with the most information density is the keyboard, right? That's why you use the command line at all. However, graphical systems have more information density than pure-text systems. You can take a pure-text system and extend it with position and color to give it more information, and then with charts and graphs to give it more information, and so forth. What I'm proposing is not drag-and-drop, although that might be useful to some users; it's a keyboard-driven system that displays information in a more dense, information-rich style. I keep thinking of building this myself but for the massive herds of yaks I'd have to shave first. 

  3. PowerShell is the usual example given here, but I confess I haven't used it. Effectively, rather than streams of raw text, think streams of well-formed data types like JSON or s-expressions or some other kind of more elaborate information. wc might instead of outputting tab-separated numbers output lists of a fixed size, then. 


Latka: Tags

Another new Latka feature: effectively, I want the ability to have abstract data types, so I'd like to hide the constructor/destructor of a type and only expose a particular interface. The way I'm doing this is with tags, which are a particular variation on runtime types that have a bit in common with data type constructors in most ML variants.

We speak of tags wrapping values. We declare a new tag either at the top level with

tag t

or in a local scope with

let tag t in (* ... *)

Once we have a tag, we can wrap a value with

x as t

and unwrap it by pattern-matching. This point is slightly unintuitive, and has been my only sticking point—assuming we have a tag t in scope, then

case x of
  y : t -> y

will extract a value that's been wrapped by type t. This is not how pattern bindings usually work—this corresponds to a Haskell pattern like

case x of Tagged y t' | t == t' -> y

and not to the naïve Haskell translation

case x of Tagged y t -> y

So in this case, unification must also be aware of the values in (lexical) scope, which is a bit of an extension. I think I like this better than the alternatives, though.

A wrapped value cannot be used as though it were unwrapped, so the following would result in a runtime error:

let tag n in
  add.1.(2 as n)
  (* error: non-numeric argument to add *)

And if the tag is not in scope, then we can effectively encapsulate our data by providing various accessor functions to work with the tagged data, e.g.

<mkMyNum,getMyNum,addMyNum> := let tag myNum in
  < \ x : num . x as myNum
    \ _       . failure."Non-numeric argument to mkMyNum"
  , \ x : myNum . x
    \ _         . failure."Non-MyNum argument to getMyNum"
  , \ (x : myNum) (y : myNum) . (add.x.y) as myNum
    \ _                       . failure."Non-myNum arguments to addMyNum"
puts getMyNum.(addMyNum.(mkMyNum.2).(mkMyNum.3))
(* prints 5 *)
puts getMyNum.(addMyNum.3.(mkMyNum.3))
(* Failure: Non-MyNum argument to addMyNum *)

One final quirk is that creating a new tag effectively creates a new token that's not visible to the user, so two tags with the same name are still two distinct tags. This fixes a problem with the original, naïve implementation of local type declarations in SML where you could write

let f = (let datatype t = X of int  in fn (X n) => n) in
let x = (let datatype t = X of bool in X true) in
  f x

which is of course nonsensical and breaks soundness and everything. In SML, this was solved by not allowing locally defined types to escape, which strikes me as unnecessarily draconian; we just need to be intelligent enough to know that the two t's above are distinct because they were declared in distinct locations. In doing so, SML would admit the sensical analogue to the Latka code above:

(* This doesn't work, but it'd be nice if it did *)
val (mkMyNum,getMyNum,addMyNum) =
  let datatype myNum = MyNum of int in
    ( fn n         => MyNum n
    , fn (MyNum n) => n
    , fn (MyNum n) =>
        fn (MyNum m) => MyNum (n + m)

In contrast, the following Latka code would print "Nope!"

f := let tag t in \ _ : t . "Yup!"
                  \ _     . "Nope!"
x := let tag t in 5 as t
puts f.x

i.e. despite both being wrapped with a tag named t, each declaration of a tag will produce a new, distinct tag. As Latka is disgustingly dynamic1, coming across an expression let tag t in ... will likely increment some hidden local integer that gets used as the tag, but that is of course an implementation detail subject to change. (Also, because of the limited scope of this language's use, it's not like I have to worry about, say, sending data between processes or multicore or anything, which makes these decisions much easier.)

  1. I was describing a Latka feature to a coworker, who suddenly stopped and said, "Wait, if Latka is dynamically typed, what's stopping you from having an expression like 5 | True | (\x.x)?" I told him, "...absolutely nothing." He was disgusted at the prospect. 


Matzo Feature Consideration

I'm still working on the Matzo language, and I need to finish the parser (as I can now evaluate more programs than I can parse, and have code written to an as-yet unimplemented module spec.) Here are a few features I'm considering but haven't implemented yet, and how they might interact:

Dynamic Default Variables

This opens a small can of worms, so to be clear, dynamic variables will reside in a different namespace than normal variables. I don't know yet whether said namespace will be denoted with a sigil (say #x) or whether you might have to pass a symbol to a function (like get.X) to access their values. The idea is that certain functions will have sensible defaults that one might want to override, and rather than explicitly matching on arity, one can instead pass them in dynamically using a particular syntax. In the examples below, I'll use an explicit syntax for lookup of a dynamically scoped variable; in particular, get.x.y will look up a dynamically scoped variable x (typically a symbol), and if it hasn't been supplied, will use the default value y.

foo :=
  let combine := in
bar := foo with Combine := (\ x y . x ";" y)

puts foo
(* prints "ab", as it uses the default combiner `cat` *)
puts bar
(* prints "a;b" *)

This will continue down dynamically, so

s1 := (get.Pn."e") " stares intently."
s2 := "It is clear that " (get.Pn."e") 
        " is interested in what is happening."
sent := se.<s1,s2>
puts sent;
puts sent with fixed Pn := "he" | "she" | "e"

will force Pn to be the chosen value in all subexpressions evaluated underneath the with clause. (Notice that nondeterminism still works in dynamically computed variables, so one must declare them as fixed at the binding site if you want them to be computed exactly once.)

Text Combinators

I've used one of these above: right now, I have a few planned.

puts wd.<"a","b","c">
(* prints "abc" *)

puts nm.<"a","b","c">
(* prints "Abc" *)

puts se.<"a","b","c">
(* prints "A b c" *)

puts pa.<"a","b","c">
(* prints "A. B. C." *)

So effectively, they function as ways of combining strings in a more intelligent way. I plan for them to do some analysis of the strings so that they don't, say, produce extraneous spaces or punctuation. (The names are of course short for word, name, sentence, and paragraph, respectively, and may very well be alises for those longer names.)

Rebindable Syntax

The conjunction of the previous two suggests that certain bits of syntax should be rebindable. A prominent example is concatenation, e.g.

puts "foo" "bar" "baz"
(* prints "foobarbaz" *)

puts "foo" "bar" "baz" with Cat := se
(* prints "Foo bar baz.", as normal string concatenation
 * has been overloaded by `se` *)

puts "foo" "bar" "baz" with Cat := fold.(\ x y . x ";" y)
(* prints "foo;bar;baz", as normal string concatenation
 * has been overloaded by a custom function *)

This could lead to some pretty phenomenal weirdness, though:

f := \ x y . add.x.y
puts f.<1,2> where App := \ f x . fold.(\ g y . g.y).(append.f.x)
(* prints 3, as we have overloaded function application to
 * automatically uncurry functions *) maybe there should be a limit on it.

Error Handling

Still not sure on this one. I don't particularly want to bring monads into this, mostly because I want the language to be a DSL for strings and not a general-purpose programming language, but at the same time, it might be nice to have a simple exception-like mechanism. One idea I was playing with was to implement a backtracking system so that errors (both raised by users and by built-in problems) could simply resume at particular points and retry until some retry limit is reached. For example, you could reject certain unlikely combinations:

x ::= a b c d
wd := let result := x x x x
      in if eq.result."aaaa" then raise Retry else result
puts mark Retry in wd

Here, mark exp in wd corresponds roughly to the following imperative pseudocode:

{ retry_count := 0
; while (retry_count < retry_max)
    { try { return wd; }
      catch (exp) { retry_count ++; }

It's a much more limited form of exception handling, which may or may not be desirable, but does give you some ability to recover from errors so long as at least some execution of your program will be error-less.

All this is heavily open to change, so we'll see.


A Petty And Insignificant Complaint About Haskell Records

Haskell records have lots of problems. Here's one that came up for me today.

You are allowed to export record members without exporting the constructor, for example, if you want to ensure some property is true of the constructed values. In the following example, the field isNeg is effectively a function of the field num:

module Foo(mkRec, num, isNeg) where

data Rec = Rec
  { num   :: Int
  , isNeg :: Bool

mkRec :: Int -> Rec
mkRec n = Rec n (n < 0)

Another module can't use the Rec constructor, but can observe the values using the exported accessors

module Bar where

addRecs :: Rec -> Rec -> Rec
addRecs r1 r2 = mkRec (num r1 + num r2)

Unfortunately, there's a hole here, which is that exporing the accessors allows us to use record update syntax, which means that we can now construct arbitrary values:

constructAnyRec :: Int -> Bool -> Rec
constructAnyRec n b = mkRec 0 { num = n, isNeg = b }

There is a way around this, namely, by rewriting the original module with manual accessors for num and isNeg:

module Foo2(mkRec, num, isNeg) where

data Rec = Rec
  { _num   :: Int
  , _isNeg :: Bool

num :: Rec -> Int
num = _num

isNeg :: Rec -> Bool
isNeg = _isNeg

mkRec :: Int -> Rec
mkRec n = Rec n (n < 0)

However, I'd assert that, morally, the correct thing to do would be to disallow record update at all if the constructor is not in-scope. The purpose of hiding the constructor at all is to ensure that a programmer must perform certain computations in order to construct a valid value, e.g. to enforce invariants on constructed data (as I'm doing here), or to avoid the possibility of pattern-matching on data. If you a programmer hides a constructor but exports its accessors, then generally I'd assert it's because of the former reason, so it would be sensible to prevent record update, as you could always write your own updates, if you so desire.

Of course, pointing out this flaw in light of the other problems with the Haskell record system is like complaining about the in-flight movie on a crashing plane, but still.


Rubbish and Blatherskite

I'm working on a programming language called Rubbish. It is called Rubbish because it is rubbish. Nobody should use it. I will not use it beyond its simple purpose, which is a personal thought experiment.

If this appeals do you, then learn Lisp instead. All dynamic languages are Lisp with a handful of macros and a modicum of discipline, anyway.

In practice, the Rubbish language will have a handful of built-in types—an integer type, a floating point type, a string or rope or something—but, for our purposes, let's assume it has only two types: variants and functions. These two are the only two currently accounted for in the semantics, which I will type up at some point.

Rubbish is weakly typed, so only values have an associated type; variables can take on any type. The primary data type mechanism is the variant, which are analogous to the tagged sums of strongly typed functional languages except without type guarantees. A variant has an obligatory name, which begins either with a capital letter (e.g. Foo), and an optional tuple of expressions, which are evaluated in sequence and stored along with the tag name. Variants are immutable, so there is no mechanism for changing the tag, the values stored with the tag, or the number of values. (If the values themselves are mutable, then they may of course be changed.)

The only meaningful operations on variants are equality and pattern matching; equality may be seen as a deficient form of pattern-matching. Pattern-matching is the only way of extracting information from a variant. For example:

rsh> my_point := Foo(One, Two)
rsh> my_point = Foo
rsh> my_point = Foo(Two, One)
rsh> my_point = Foo(One, Two)
rsh> match my_point { Foo -> Nullary
                      Foo(x) -> Unary(x)
                      Foo(x, y) -> Binary(y, x)
                      Foo(x, y, z) -> Ternary(z, y, x)
                      _ -> Other
Binary(Two, One)

We can use them as a cheap and easy way to define the Peano numbers.

rsh> add := fun(x, y):
       match x {
         Zero -> y
         Succ(x') -> add(x', Succ(y))

I plan on eventually implementing user-definable mix-fix operators, so you could eventually write the following:

rsh> _+_ := add
rsh> Succ(Succ(Zero)) + Succ(Succ(Succ(Zero)))

The other type is functions, which are introduced with the fun keyword and take a tuple of arguments. There is another syntax for functions of one argument, and it looks kind of like a hash literal. The definitions of isZero and isZero' below have identical functionality. (Notice that this means match x cases is basically syntactic sugar for cases(x).)

rsh> isZero  := fun(x) ->
       match x {
         Zero -> True
         _    -> False
rsh> isZero' := { Zero -> True
                 _    -> False
rsh> isZero'(Succ(Zero))
rsh> isZero'(True)

Combine this with the following sprinkling of syntactic sugar:

_._ := fun(f, x): f(x)

And we get the following:

rsh> makePoint := fun(x, y):
       { IsOrigin -> x = 0 and y = 0
         GetX -> x
         GetY -> y
         Add(p) -> makePoint(x + p.GetX, y + p.GetY)
         ToString -> format("Point({0}, {1})", x, y)
         method -> raise MethodNotFound(method)
rsh> a := makePoint(2, 3)
rsh> b := makePoint(4, 5)
rsh> a(IsOrigin)
rsh> a.IsOrigin
rsh> a.ToString
"Point(2, 3)"
rsh> a.Add(b).ToString
"Point(6, 8)"
rsh> a.Add(makePoint(-2,-3)).IsOrigin

While variants can encode data, functions like these can easily encode codata like streams.

rsh> ones := { Head -> 1
               Tail -> nats
rsh> streamMap := fun(f, stream): { Head -> f(stream.Head)
                                    Tail -> streamMap(f, stream.Tail)
rsh> nats := { Head -> 1
               Tail -> streamMap(fun(x): x + 1, nats)
rsh> take := fun(num, stream):
       match num {
         0 -> Nl
         _ -> Cs(stream.Head, take(num - 1, stream.Tail))
rsh> take(5, ones)
Cs(1, Cs(1, Cs(1, Cs(1, Cs(1, Nl)))))
rsh> take(5, nats)
Cs(1, Cs(1, Cs(3, Cs(4, Cs(5, Nl)))))

Finally, because of mixfix operator definition, we are free to do some pretty horrific things. I am omitting precedence declaration here because I haven't decided on a syntax for it yet.

rsh> _->_ := _._
rsh> nats->Tail->Head
rsh> _<-_ := fun(x, f): f(x)
rsh> Head<-Tail<-nats

That's some Rubbish for you. There's one major feature I haven't covered yet, and that's delimited continuations. As I said before, if any of this appeals to you, then you are a horrible human being, but I would recommend a healthy dose of Scheme, ML, or Haskell.


Matzo Prototype

I've had an idea kicking around in the back of my head for a while, and I finally got around to making a working prototype (which I will probably end up deciding is the final version, at some point.) It has to do with generating random words based on grammar rules, with an emphasis on making certain things more likely and generating 'average' words. It came out of early D&D playing, where I wanted to have a consistent set of phonologies for every fake language that D&D had.

It is called Matzo, because I asked a friend for a random name and that's what she said. She also suggested, among other things, Yoko, but I decided against that.

Here's the basic idea: the following is verbatim a Matzo source file, which I have saved as aquan.mtz for testing:

word := syllable . syllable . (6 @ (syllable));
use word;
syllable := 2: vowel
          |    vowel . "'"
          | 4: consonant . vowel
          |    consonant . vowel . "'";
consonant ::= p t k h wh l m n ng r w;
vowel ::= a i u e o;

These statements can be in any order, and whitespace isn't significant except as a separator for certain tokens, so you can format/arrange the lines however you want. There are three kinds of statements:

  • use statements tell you which rule is going to start. If you omit one, it won't run. If you have two, it picks the first one.
  • Normal assignments (represented by :=) take an expression that contains a mixture of disjunction (a | b), concatenation (a . b), weighting (5: b) and repetition (5@(b)) and which can be parenthesized. Expressions also include both literals, which are surrounded by quotation marks, and identifiers, which refer to other rules.
  • Literal assignments (represented by ::=) which differ in that they are assumed to have a space-separated list of literals, possibly with weighting. This was for the common case where you want one rule to contain a simple disjunction of literals, such as consonant and vowel above.

A lot of these things are shorthand for other things—really, all you need is concatenation, disjunction, and literals, and you can do most everything. The syntax 5: foo is shorthand for foo foo foo foo foo, so it is used to make an option more prevalent in a disjunction. For example,

vowel ::= i u;

chooses i and u about as often, whereas

vowel ::= 9:i u;

will choose i nine times out of ten.

The syntax 5@(foo) is another shorthand, somewhat less useful. The statement

word := 3@(syllable);

is equivalent to

word := syllable | syllable . syllable | syllable . syllable . syllable;

and is used in many various circumstances in my grammars, although it's less useful in other circumstances.

There are still problems with my implementation, which I am going to fix and put up on Github, but the grammar shown above (and others) work correctly. This is an idea I've had for ages; I can't believe it's taken me so long to sit down and write it, especially as it took no time at all. (I did the whole thing while an autograder was running for something I was grading.)

Here's an example of running the above file:

[getty@arjuna matzo]$ ./matzo aquan.mtz
[getty@arjuna matzo]$ ./matzo aquan.mtz
[getty@arjuna matzo]$ ./matzo aquan.mtz
[getty@arjuna matzo]$ ./matzo aquan.mtz

It's not necessarily limited to random words, but it's lacking a lot of utility that would make it sufficient for other purposes, which I will eventually add. Still, as an example of what it could also be used for:

gender ::= man woman;
hair-color ::= black brown red blonde pink;
build ::= fat fit skinny;
job := "doctor" | "lawyer" | "janitor" | "systems analyst";
description := "This " . gender . " is a " . job . " with "
             . hair-color . " hair and a " . build . " build.";
use description;

Running this yields:

[getty@arjuna matzo]$ ./matzo description.mtz
This man is a systems analyst with brown hair and a skinny build.
[getty@arjuna matzo]$ ./matzo description.mtz
This woman is a janitor with blonde hair and a fit build.
[getty@arjuna matzo]$ ./matzo description.mtz
This woman is a lawyer with blonde hair and a fat build.
[getty@arjuna matzo]$ ./matzo description.mtz
This woman is a janitor with red hair and a fat build.

In the future: variables (e.g. reusing a single generated value) and predicates (e.g. pronoun(man) := he, pronoun(woman) := she for use in complicated expressions.) Still, I'm proud of how far it is with so little work.