This page briefly presents a select number of features that are either entirely new or have a unique flavor in Arcadia.
From what we've seen so far, one might assume that Arcadia's idiom for I/O is just like C++'s without the
extra <<
and >>
operators, but in reality it's closer to what's
offered by Boost.Spirit.
What this means is that when you specify an expression to the right of some_stream >>
,
it is interpreted as a parsing expression grammar (PEG)
and a recursive descent parser is automatically generated to parse it.
Before getting into more advanced examples, lets take a look at something simple to make sure we are on the same page:
1 auto (mut x, mut y) = (0.0, 0.0);
2
3 console << "Enter a point of the form (x,y): ";
4 console >> '(' x ',' y ')';
5 console << "Your input was x = " x " and y = " y ".\n";
console
bundles the standard streams for input, output, error reporting and logging that are
normally tied to a terminal. <<
and >>
specify that we want to do
output and input respectively. No additional operators are needed because Arcadia supports juxtaposition.
Everything after >>
in line 4 is interpreted as the right-hand-side of a
rule with the initial non-terminal of an implicit grammar on the left. The character literals are
interpreted as terminals, while the mutable variables are interpreted as non-terminals. The rules producing
the latter are searched for in the variables' types.
Since PEGs, like EBNF, support the repetition operator we can get a little fancier without involving more
rules. Below a%b
is a shorthand for a(ba)*
:
1 auto mut numbers : Vector(int) = [];
2
3 console >> ws* numbers[$0] % (ws* ',' ws*);
4 sort(numbers);
5 console << numbers[$0] % ", ";
The above will read a non-empty list of comma-separated integers, sort it, and output it again with only
one space after each comma. ws
stands for whitespace. As we'll see later,
the parser won't mistake *
for multiplication because ws
doesn't have a type
that supports multiplication.
You can think about numbers[$0]
as the nth element of the vector. Because it's
a vector of int
s the parser will search in the latter class for a way to parse them. We
can't just use e.g. numbers[0]
because that would result in all the numbers being read in the
same location. At this point it's not clear whether use of $0
will generalize well with other
containers or something like where element ∈ container: expression
will be needed.
While adding rules explicitly is definitely an option, adding them implicitly by defining the recursive data types that will represent the AST in memory involves way less redundancy! The following will read a limited subset of Lisp lists that only contain numbers and output them again with redundant whitespace removed.
1 +structure LispList
2 auto elements : Vector(double ∪ int ∪ LispList) = []; # private by default
3
4 syntax '(' ws* (elements[$0] % ws+)? ws* ')'; # for both input and output
5 −structure LispList
6
7 auto mut list : LispList; # implicitly-defined default constructor
8 # initializes to an empty list.
9
10 console >> ws* list ws*;
11 console << list;
Lines 1–5 define a new type/class/structure with elements
as its only field. ∪
,
when used with types, creates discriminated unions. When those are read or written, each type is tried in
turn until one matches. As expected, +
means "one or more" and ?
means
"optional".
Just as with <<
and >>
, the expression in the syntax declaration in
line 4 is interpreted as the right-hand-side of a rule. Only this time the implicit non-terminal on the
left-hand-side is one unique to LispList. Although it will be possible to provide different expressions for
reading and writing, or even custom recognizers and generators, here a single expression suffices for both.
When doing output ws*
will be ignored and ws+
will result in a single space
character, hence the removal of extra whitespace. In theory it should be possible to figure out how many
characters are missing to reach a specified width and distribute them among ws*
generators,
but I haven't gotten to designing that part yet! :)
Many functions manipulating arrays impose implicit constraints on the sizes of those arrays. e.g.
Although one could easily express such constraints in a precondition block, this approach has a few limitations:
Consequently Arcadia's type system will support expressing linear constraints directly in the function's type using size variables. That will be in addition to the arbitrary expressions we saw earlier for specifying the size of a returned array. Let's look at some examples:
1 +function mergesort :
2 (a : int[n]) ⟶ external ref int[n]
3 =code
4 +if n ≤ 1:
5 return a;
6 =else
7 auto m = n/2+1;
8 return merge(mergesort(a[0..m],
9 mergesort(a[m+1..$]));
10 −if
11 −function mergesort
Here we have an erroneous implementation of merge sort that neglects an element near the middle. Perhaps a
novice programmer forgot that ..
doesn't include its right operand to the range and tried to
add 1 to the second range.
If we were to compile this, we'd get an error in line 8 complaining that we are trying to assign an array
of n−1 elements to one of n. n
is a size variable. The respective type of merge used when
checking merge sort is:
1 merge : (a : int[n], b : int[m]) ⟶ external ref int[n+m]
By the way, $
by itself inside an array indexing operator refers to the size of the array,
like in D. When the indexing operator is used
with a range of indices, it returns a slice referring to the corresponding part of the array.
On the other hand, if we were to remove n
from the type (and replace it with |a|
in line 7), it would have been inferred from line 5... and we'd still get the error! Moreover, a single
hidden parameter is used for both a
and retval
since they are known to have the
same size.
You can find more examples, as well as details on the systems of equations that are created and how their solutions are interpreted here.
Well, not exactly, but they should come close enough! We've already seen contracts in factorial and de Casteljau. The idea is that once you've specified the preconditions, postconditions and invariants of your functions and classes, all you need is some way to automatically generate large amounts of input to exercise them.
Arcadia will come with a standard library of generators for creating random instances of most standard types, plus reflection-based mechanisms for creating generators for new types automatically or writing your own from scratch. Additional features like tweaking the probability with which each instance is generated, generating instances of specific size or generating "simpler" instances from existing ones will also be supported.
Testing stateful code will be supported using techniques inspired from this. Although I may not go as far as model-based testing.
An accompanying tool will be responsible for generating and using a configurable number of test cases on the selected software components and reporting on the results.
Unit and property-based testing will supplement each other nicely. While unit tests focus on the concrete output of a handful of inputs where the latter is known in advance, property-based testing will lend itself to testing abstract properties over a larger set of inputs.
Basically what automated benchmarks will do is apply the methodology used for physics experiments to measure program performance. Although this is not new, I'm not aware of any language or tool that does it automatically.
The idea is relatively simple: Suppose you know the computational complexity of your function because you proved it, or because this is a published algorithm and someone else did. You want to know the constant factors hidden in the Big O notation so that you know how your implementation fairs compared to others or to extrapolate how it will perform for some input you can't test. Or maybe you just want something more reliable than a measurement for a single data point.
So you provide a set of inputs for which to sample the performance of your function (preferably spanning its entire input space) and you write the formula corresponding to that complexity but without removing the constant factors — the model.
If needed, you supply one more function mapping inputs to their size (e.g. for a sort function the arrays may be mapped to their length).
When you invoke the benchmarks, your function will be called multiple times for each input and the running time (or some other property like memory footprint) logged. Once all the values are collected, a linear regression algorithm will calculate the most probable values for your model's parameters and the corresponding confidence intervals. A plot will be presented with both the collected data and the fitted model so that you can compare theoretical and actual performance, spot outliers, etc.
Theoretically we can use symbolic regression so that the model is not specified at all, but re-implementing Eureqa is outside the scope of this project. :) Besides, linear models are enough for most practical complexity classes (e.g. O(n⋅lg n) still corresponds to a linear model since the unknown is not in the base of the logarithm) and linearization is also an option.
This has multiple applications from comparing different implementations to finding the right threshold to switch between sequential and parallel execution.
To give you a rough idea about what such a plot could look like, I'm including a simplified version of one I created manually back in 2014.
Here 3 different algorithms for removing duplicates from a vector of integers without altering the order of
those integers are compared. The all-pairs method tests all pairs for equality, the hash-table
method puts them in an unordered_set
container and the indirect-sort method creates
a vector
of indices to the original and sorts it. All algorithms are implemented in C++.
We can see as expected that testing all pairs is faster than sorting for small arrays, but gets surpassed by indirect-sort when size reaches 147 elements. Perhaps less obvious is the fact that sorting is faster than putting in a hash table for arrays smaller than around 338844 elements. The latter intersection didn't fit into the graph.
No design can cover everything, so it's important for a language to provide tools for creating great libraries. Although the ingenious library builder can do miracles towards approaching the target domain with just the basest form of operator overloading and/or definition, more powerful mechanisms may mean less projects like Qt and ODB go the pre-processor / compiler way.
Arcadia is being designed from the ground up to provide extensible syntax and semantics. In fact, everything that you've seen so far will be the initial language matching the symbol table contents at compiler startup. From there on, loading/unloading modules and/or defining/undefining symbols of operator type can modify the language.
And yes, you've read correctly: operator status will be a property of the function type, not the function name as is the norm. Operators will also be able to introduce scopes where more operators are defined/undefined.
New operators can have any fixity (prefix, postfix, infix, distfix) and be overloaded by either their fixity or semantics. Operator precedence and associativity as well as operand types will help resolve ambiguities. Juxtaposition will be considered an operator, while chaining will be considered an 'associativity'.
Bringing all the above together should enable all of the following representative applications and many more:
(u⋅v)v/‖v‖²
instead of v.mult(u.innerProd(v)).div(v.innerProd(v))
or (/ (* (* u v) v) (* v v))
.v .+ u
1 # Arcadia code here
2 import compilers.c.*;
3
4 // C code here
5 // Arcadia code won't compile
6
7 #pragma arcadia
8 # Arcadia code works again!
1 import server.pages.*;
2
3 <html>
4 <head>
5 <title>Current Time</title>
6 </head>
7
8 <body>
9 <h2 style="text-align:center">
10 <%= system.time.now() %>
11 </h2>
12 </body>
13 </html>
I'll close with a disclaimer: although I've implemented a few of the above features and have a strategy for implementing the rest, it's possible that I haven't foreseen everything. Making sure everything works smoothly together in a single parser will be part of my job, should this project be funded.