Arcadia's Unique Features

Unique Features

This page briefly presents a select number of features that are either entirely new or have a unique flavor in Arcadia.

PEG-Based DSL for I/O

From what we've seen so far, one might assume that Arcadia's idiom for I/O is just like C++'s without the extra << and >> operators, but in reality it's closer to what's offered by Boost.Spirit. What this means is that when you specify an expression to the right of some_stream >>, it is interpreted as a parsing expression grammar (PEG) and a recursive descent parser is automatically generated to parse it.

Before getting into more advanced examples, lets take a look at something simple to make sure we are on the same page:
```
1 auto (mut x, mut y) = (0.0, 0.0);
2 
3 console << "Enter a point of the form (x,y): ";
4 console >> '(' x ',' y ')';
5 console << "Your input was x = " x " and y = " y ".\n";
```
console bundles the standard streams for input, output, error reporting and logging that are normally tied to a terminal. << and >> specify that we want to do output and input respectively. No additional operators are needed because Arcadia supports juxtaposition.

Everything after >> in line 4 is interpreted as the right-hand-side of a rule with the initial non-terminal of an implicit grammar on the left. The character literals are interpreted as terminals, while the mutable variables are interpreted as non-terminals. The rules producing the latter are searched for in the variables' types.

Since PEGs, like EBNF, support the repetition operator we can get a little fancier without involving more rules. Below a%b is a shorthand for a(ba)*:
```
1 auto mut numbers : Vector(int) = [];
2 
3 console >> ws* numbers[$0] % (ws* ',' ws*);
4 sort(numbers);
5 console << numbers[$0] % ", ";
```
The above will read a non-empty list of comma-separated integers, sort it, and output it again with only one space after each comma. ws stands for whitespace. As we'll see later, the parser won't mistake * for multiplication because ws doesn't have a type that supports multiplication.

You can think about numbers[$0] as the nth element of the vector. Because it's a vector of ints the parser will search in the latter class for a way to parse them. We can't just use e.g. numbers[0] because that would result in all the numbers being read in the same location. At this point it's not clear whether use of $0 will generalize well with other containers or something like where element ∈ container: expression will be needed.

While adding rules explicitly is definitely an option, adding them implicitly by defining the recursive data types that will represent the AST in memory involves way less redundancy! The following will read a limited subset of Lisp lists that only contain numbers and output them again with redundant whitespace removed.
```
 1 +structure LispList
 2     auto elements : Vector(double ∪ int ∪ LispList) = []; # private by default
 3     
 4     syntax '(' ws* (elements[$0] % ws+)? ws* ')'; # for both input and output
 5 −structure LispList
 6 
 7 auto mut list : LispList; # implicitly-defined default constructor 
 8                           # initializes to an empty list.
 9                           
10 console >> ws* list ws*;
11 console << list;
```
Lines 1–5 define a new type/class/structure with elements as its only field. ∪, when used with types, creates discriminated unions. When those are read or written, each type is tried in turn until one matches. As expected, + means "one or more" and ? means "optional".

Just as with << and >>, the expression in the syntax declaration in line 4 is interpreted as the right-hand-side of a rule. Only this time the implicit non-terminal on the left-hand-side is one unique to LispList. Although it will be possible to provide different expressions for reading and writing, or even custom recognizers and generators, here a single expression suffices for both.

When doing output ws* will be ignored and ws+ will result in a single space character, hence the removal of extra whitespace. In theory it should be possible to figure out how many characters are missing to reach a specified width and distribute them among ws* generators, but I haven't gotten to designing that part yet! :)
Array Size Constraints

Many functions manipulating arrays impose implicit constraints on the sizes of those arrays. e.g.
- Element-wise arithmetic operations require both operands and the result to have the same size.
- The output of merging two sorted arrays or concatenating two strings has a size equal to the sum of the sizes of those arrays.
- Knot insertion increases the number of both knots and control points of a B-spline curve by one.
Although one could easily express such constraints in a precondition block, this approach has a few limitations:
- Errors aren't discovered until the program runs.
- The compiler can't use the information to infer sizes that are left out.
- The compiler can't use the information to optimize the number of hidden parameters used to pass sizes of runtime-sized arrays.
Consequently Arcadia's type system will support expressing linear constraints directly in the function's type using size variables. That will be in addition to the arbitrary expressions we saw earlier for specifying the size of a returned array. Let's look at some examples:
```
 1 +function mergesort :
 2     (a : int[n]) ⟶ external ref int[n]
 3 =code
 4     +if n ≤ 1:
 5         return a;
 6     =else
 7         auto m = n/2+1;
 8         return merge(mergesort(a[0..m],
 9                      mergesort(a[m+1..$]));
10     −if
11 −function mergesort
```
Here we have an erroneous implementation of merge sort that neglects an element near the middle. Perhaps a novice programmer forgot that .. doesn't include its right operand to the range and tried to add 1 to the second range.

If we were to compile this, we'd get an error in line 8 complaining that we are trying to assign an array of n−1 elements to one of n. n is a size variable. The respective type of merge used when checking merge sort is:
```
1 merge : (a : int[n], b : int[m]) ⟶ external ref int[n+m] 
```
By the way, $ by itself inside an array indexing operator refers to the size of the array, like in D. When the indexing operator is used with a range of indices, it returns a slice referring to the corresponding part of the array.

On the other hand, if we were to remove n from the type (and replace it with |a| in line 7), it would have been inferred from line 5... and we'd still get the error! Moreover, a single hidden parameter is used for both a and retval since they are known to have the same size.

You can find more examples, as well as details on the systems of equations that are created and how their solutions are interpreted here.
Contracts + Generators = Property-Based Testing

Well, not exactly, but they should come close enough! We've already seen contracts in factorial and de Casteljau. The idea is that once you've specified the preconditions, postconditions and invariants of your functions and classes, all you need is some way to automatically generate large amounts of input to exercise them.

Arcadia will come with a standard library of generators for creating random instances of most standard types, plus reflection-based mechanisms for creating generators for new types automatically or writing your own from scratch. Additional features like tweaking the probability with which each instance is generated, generating instances of specific size or generating "simpler" instances from existing ones will also be supported.

Testing stateful code will be supported using techniques inspired from this. Although I may not go as far as model-based testing.

An accompanying tool will be responsible for generating and using a configurable number of test cases on the selected software components and reporting on the results.

Unit and property-based testing will supplement each other nicely. While unit tests focus on the concrete output of a handful of inputs where the latter is known in advance, property-based testing will lend itself to testing abstract properties over a larger set of inputs.
Automated Benchmarks

Basically what automated benchmarks will do is apply the methodology used for physics experiments to measure program performance. Although this is not new, I'm not aware of any language or tool that does it automatically.

The idea is relatively simple: Suppose you know the computational complexity of your function because you proved it, or because this is a published algorithm and someone else did. You want to know the constant factors hidden in the Big O notation so that you know how your implementation fairs compared to others or to extrapolate how it will perform for some input you can't test. Or maybe you just want something more reliable than a measurement for a single data point.

So you provide a set of inputs for which to sample the performance of your function (preferably spanning its entire input space) and you write the formula corresponding to that complexity but without removing the constant factors — the model.

If needed, you supply one more function mapping inputs to their size (e.g. for a sort function the arrays may be mapped to their length).

When you invoke the benchmarks, your function will be called multiple times for each input and the running time (or some other property like memory footprint) logged. Once all the values are collected, a linear regression algorithm will calculate the most probable values for your model's parameters and the corresponding confidence intervals. A plot will be presented with both the collected data and the fitted model so that you can compare theoretical and actual performance, spot outliers, etc.

Theoretically we can use symbolic regression so that the model is not specified at all, but re-implementing Eureqa is outside the scope of this project. :) Besides, linear models are enough for most practical complexity classes (e.g. O(n⋅lg n) still corresponds to a linear model since the unknown is not in the base of the logarithm) and linearization is also an option.

This has multiple applications from comparing different implementations to finding the right threshold to switch between sequential and parallel execution.

To give you a rough idea about what such a plot could look like, I'm including a simplified version of one I created manually back in 2014.

Here 3 different algorithms for removing duplicates from a vector of integers without altering the order of those integers are compared. The all-pairs method tests all pairs for equality, the hash-table method puts them in an unordered_set container and the indirect-sort method creates a vector of indices to the original and sorts it. All algorithms are implemented in C++.

We can see as expected that testing all pairs is faster than sorting for small arrays, but gets surpassed by indirect-sort when size reaches 147 elements. Perhaps less obvious is the fact that sorting is faster than putting in a hash table for arrays smaller than around 338844 elements. The latter intersection didn't fit into the graph.
Extensibility

No design can cover everything, so it's important for a language to provide tools for creating great libraries. Although the ingenious library builder can do miracles towards approaching the target domain with just the basest form of operator overloading and/or definition, more powerful mechanisms may mean less projects like Qt and ODB go the pre-processor / compiler way.

Arcadia is being designed from the ground up to provide extensible syntax and semantics. In fact, everything that you've seen so far will be the initial language matching the symbol table contents at compiler startup. From there on, loading/unloading modules and/or defining/undefining symbols of operator type can modify the language.

And yes, you've read correctly: operator status will be a property of the function type, not the function name as is the norm. Operators will also be able to introduce scopes where more operators are defined/undefined.

New operators can have any fixity (prefix, postfix, infix, distfix) and be overloaded by either their fixity or semantics. Operator precedence and associativity as well as operand types will help resolve ambiguities. Juxtaposition will be considered an operator, while chaining will be considered an 'associativity'.

Bringing all the above together should enable all of the following representative applications and many more:
1. Writing the projection of vector u onto v as (u⋅v)v/‖v‖² instead of v.mult(u.innerProd(v)).div(v.innerProd(v)) or (/ (* (* u v) v) (* v v)).
2. Using EBNF and arithmetic operations in the same statement.
3. Defining an operator that modifies another operator doing scalar arithmetic to work element-wise on vectors. e.g. v .+ u
4. Excluding certain rarely used and/or unsafe features like the goto statement or reinterpretation casts from the initial language, but making them available as modules.
5. Embedding other languages into Arcadia and vice versa. e.g.:
```
1 # Arcadia code here
2 import compilers.c.*;
3 
4 // C code here
5 // Arcadia code won't compile
6 
7 #pragma arcadia
8 # Arcadia code works again!
```
  or (inspired from JSP):
```
 1 import server.pages.*;
 2 
 3 <html>
 4     <head>
 5         <title>Current Time</title>
 6     </head>
 7 
 8     <body>
 9         <h2 style="text-align:center">
10             <%= system.time.now() %>
11         </h2>
12     </body>
13 </html>
```
I'll close with a disclaimer: although I've implemented a few of the above features and have a strategy for implementing the rest, it's possible that I haven't foreseen everything. Making sure everything works smoothly together in a single parser will be part of my job, should this project be funded.

Unique Features

PEG-Based DSL for I/O

Array Size Constraints

Contracts + Generators = Property-Based Testing

Automated Benchmarks

Extensibility