Runtime-sized Types

Runtime-sized Arrays
- They are parameterized by expressions that are evaluated at runtime to compute the array's size.
- There will be ways to specify constraints on the sizes of such arrays, that should be met when arguments are bound to parameters.
- Under some circumstances those constraints will be solvable at compile-time, in others runtime checks will be emitted.
- Use cases
- - string operations (e.g. concatenation)
  - vector operations (e.g. addition)
  - spline operations (e.g. raise or lower degree, evaluate, etc.)
  - OS APIs (e.g. functions that need you to call them once to get the size of an array, allocate the array and call them again with the array)
- Implementation
- - The actual array sizes will be passed as extra hidden arguments.
  - Examples
  - - merge(T a[n], T b[k])⟶T[n+k]
    - }
    - In the above we have 3 different lengths, so 4 hidden arguments will be passed (the 4th is for the returned array's address).
    - op+(double a[n], double b[n])⟶doube[n]
    - }
    - Here we have one hidden argument for the base address of the resulting array and one hidden argument for the common length of the input and output arrays.
    - In reality, two functions are defined: the explicit op+ and an implicit size calculation function (a.size, b.size)⟶retval.size.
  - Alternatively, we could put the lengths in the first word of the array representation, like Java does.
  - - This might be wasteful if not all lengths are needed, but may be faster if they are used frequently.
  - For multidimensional arrays it is preferable to put all dimensions in the beginning than to have arrays of arrays, each with its own length like in Java.
- Size variables
- - There are a number of options when it comes to how they are defined:
  - - Don't use them: use |name_of_array| instead. For short array names it'll be almost as short.
    - Whenever an undeclared identifier appears in an array size expression, assume it names a size variable.
    - Prefix size variable names with a special character, so that they can be searched in their own namespace and they don't get captured when someone defines a normal variable of the same name in the scope. e.g. @n, @k, etc.
    - Require them to be explicitly defined, either as normal parameters or using a special syntax.
- Return Value
- - When arrays are returned by external reference (or by value too?) there are the following possibilities:
  - - An expression is given so that the caller can calculate the size. It must be side-effect free.
    - The size expression is omitted, but the size expression can be inferred. The inferred expression will be made part of the interface of the function. It can't depend on retval.size.
    - An expression is not given and can't be inferred from the function body. The function compiles, but can't be used unless the result is assigned to an object of known size: that size will be passed to the function. A warning about limited usability of the function can be issued.
    - An error is issued.
- Constraints
- - Linear constraints should be solvable automatically at compile-time and be sufficient for 80% of the applications including the ones mentioned above.
  - It may also be possible to combine one-way constraints with linear ones in the same functions.
  - In some cases even non-linear constraints are solvable for some of their variables.
  - Linear
  - - Goals
    - - Identify cases when not all sizes need to be passed as hidden variables, but some are identical or can be computed by the others.
      - e.g. We may decide that sizes that can be computed from a base size using an addition or subtraction need not be passed individually, while others that need a multiplication should.
      - Statically check for correctness instead of emitting runtime checks.
      - Allow some sizes to be left out and inferred from others.
      - Generate the function that will be used to calculate the size of the output as a function of the sizes of the inputs.
    - Notation
    - - @ for internal size variables produced by the system.
      - $ for explicit size variables defined by the programmer.
      - # for program values of integral type that can't be known at compile-time and are used as array sizes. Those known at compile-time appear verbatim.
    - Algorithm
    - - The solver algorithm should be able to treat each function's body individually so that systems of equations don't get too big.
      - When solving the system for an individual function body the following should be considered known:
      - Internal size variables associated with input parameters for this body.
        
        Values of normal expressions, used as array sizes.
      - Everything else should be considered unknown, including internal size variables associated with input parameters of recursive calls.
      - Deterministic expressions that are written in an identical way (save for whitespace) should result in the same # parameter.
      - The compiler might be able to prove that different computations result in the same value and in such a case it should use the same # parameter as well, but this is not guaranteed.
      - An error is issued if the system has zero solutions and a different one if it has infinitely many.
      - If the system has a unique solution then:
      - If it imposes additional constraints to a # parameter, an error is issued.
        
        If it imposes additional constraints to a @ parameter, an informative message can be printed if the corresponding bounds expression was originally omitted.
        
        If a public size depends on a # parameter whose computation depends on private values or involves side-effects, an error is issued.
      - Otherwise the function type checks.
    - Examples
    - - merge : (a : int[na], b : int[nb]) ⟶ external ref int[na+nb]
      - No binding of arrays takes place in the function body so only the header is relevant here.
      - The following system of equations is constructed:
      - @m-0-a = $na
        
        @m-0-b = $nb
        
        @m-0-ret = $na + $nb
      - The system has a unique solution that depends on the 2 known @ parameters:
      - @m-0-ret = @m-0-a + @m-0-b
        
        $na = @m-0-a
        
        $nb = @m-0-b
      - Only the first equation is kept since na and nb are not directly used in the body. It will be used by the called for allocation.
      - @m-0-a and @m-0-b will be passed as hidden variables while @m-0-ret will be computed when needed.
      - +function mergesort :
      - (a : int[n]) ⟶ external ref int[n]
      - The following system of equations is constructed:
      - @ms-0-a = $ms-0-n
        
        @ms-0-ret = $ms-0-n
        
        @ms-0-ret = @ms-0-a
        
        @ms-1-a = #m−0
        
        @ms-1-a = $ms-1-n
        
        @ms-1-ret = $ms-1-n
        
        @ms-2-a = @ms-0-a − #m
        
        @ms-2-a = $ms-2-n
        
        @ms-2-ret = $ms-2-n
        
        @m-0-a = @ms-1-ret
        
        @m-0-b = @ms-2-ret
        
        @m-0-ret = @m-0-a + @m-0-b
        
        @ms-0-ret = @m-0-ret
      - The system has a unique solution that depends on 1 @ parameter and 1 # parameter:
      - @ms-0-ret = @ms-0-a
        
        $ms-0-n = @ms-0-a
        
        @ms-1-a = #m
        
        $ms-1-n = #m
        
        @ms-1-ret = #m
        
        @ms-2-a = @ms-0-a − #m
        
        $ms-2-n = @ms-0-a − #m
        
        @ms-2-ret = @ms-0-a − #m
        
        @m-0-a = #m
        
        @m-0-b = @ms-0-a − #m
        
        @m-0-ret = @ms-0-a
      - No public sizes depend on private # parameters.
      - The following equations will be kept for code generation:
      - @ms-0-ret = @ms-0-a
        
        @ms-1-a = #m
        
        @ms-2-a = $ms-0-a − #m
      - Only @ms-0-a will be passed as hidden variable. @ms-0-ret is identical.
      - +function bad_mergesort :
      - The following system of equations is constructed:
      - @ms-0-a = $ms-0-n
        
        @ms-0-ret = $ms-0-n
        
        @ms-0-ret = @ms-0-a
        
        @ms-1-a = #m−0
        
        @ms-1-a = $ms-1-n
        
        @ms-1-ret = $ms-1-n
        
        @ms-2-a = @ms-0-a − #m − 1
        
        @ms-2-a = $ms-2-n
        
        @ms-2-ret = $ms-2-n
        
        @m-0-a = @ms-1-ret
        
        @m-0-b = @ms-2-ret
        
        @m-0-ret = @m-0-a + @m-0-b
        
        @ms-0-ret = @m-0-ret
      - The system has no solutions:
      - @ms-0-ret = @ms-0-ret − #m − 1 + #m
        
        or
        
        0 = 1
      - +function split :
      - The following system of equations is constructed:
      - @sp-0-a = $sp-0-n + $sp-0-m
        
        @sp-0-ret0 = $sp-0-n
        
        @sp-0-ret1 = $sp-0-m
      - The system has infinitely many solutions:
      - @sp-0-ret0 = @sp-0-a − @sp-0-ret1
        
        $sp-0-n = @sp-0-a − @sp-0-ret1
        
        $sp-0-m = @sp-0-ret1
      - An error should be issued explaining the ambiguity. Alternatively, one could investigate further to see that making @sp-0-ret0 or @sp-0-ret1 known results in a unique solution and only warn about the function's limited applicability. In such a scenario the function result could be assigned to an array of known size without an error, but the function would still likely fail with array-out-of-bounds at run-time.
      - +function zip :
      - (T, S : Type) ⟶ (a : T[n], b : S[n]) ⟶ (T, S)[n]
      - −function zip
      - The following system of equations is constructed:
      - @zip-0-a = $zip-0-n
        
        @zip-0-b = $zip-0-n
        
        @zip-0-ret = $zip-0-n
      - The system has a unique solution that depends on 1 @ parameter:
      - @zip-0-ret = @zip-0-a
        
        $zip-0-n = @zip-0-a
        
        @zip-0-b = @zip-0-a
      - An additional constraint has been imposed on @zip-0-b, but since the array bound was filled, that was probably the intention and no message is printed.
      - The following equations are kept as part of the function's interface:
      - @zip-0-ret = @zip-0-a
        
        @zip-0-b = @zip-0-a
      - Only @zip-0-a is passed as a hidden variable.
      - itoa : (n : uint) ⟶ ubyte[floor(log10(n))+1]
      - @itoa-0-ret = #itoa-0-ret

- - - - +function ascending_digits :
      - (n : uint) ⟶ (ubyte, bool)[]
      - The following system of equations is constructed:
      - @ad-0-dg = @itoa-0-ret
        
        @itoa-0-ret = #itoa-0-ret
        
        @zip-0-a = @ad-0-dg
        
        @zip-0-b = @ad-0-as
        
        @zip-0-ret = @zip-0-a
        
        @zip-0-b = @zip-0-a
        
        @ad-0-ret = @zip-0-ret
      - The system has a unique solution that depends on 1 # parameter:
      - @ad-0-ret = #itoa-0-ret
        
        @ad-0-dg = #itoa-0-ret
        
        @ad-0-as = #itoa-0-ret
        
        @zip-0-ret = #itoa-0-ret
        
        @zip-0-a = #itoa-0-ret
        
        @zip-0-b = #itoa-0-ret
        
        @itoa-0-ret = #itoa-0-ret
      - #itoa-0-ret is computed by the expression "floor(log10(n))+1", which only depends on ascending_digits' parameter and is side-effect free and thus, is suitable for evaluation by an arbitrary caller.
      - The following equations are kept as part of the function's interface:
      - @ad-0-ret = #itoa-0-ret
      - Only @ad-0-ret is passed as a hidden variable.
      - #itoa-0-ret = floor(log10(n))+1 is inferred as the size of both ascending and retval arrays.
      - Additionally the following linear inequalities could have been inferred:
      - @ad-0-as > 0
      - An extension to the system that would flag this as an error could lead us to revise the type of itoa to something that exposes the fact that ∀n.|itoa(n)|>0.
      - If f ∈ T ⟶ (int[n], uint[n+10]) and g ∈ (int[m−10], uint[m]), then g(f(x)) might need to solve the system
      - l1 = n
        
        l2 = n+10
        
        l3 = m−10
        
        l4 = m
        
        n = m−10
        
        n+10 = m
      - assuming l1, l2 and n are known and l3, l4 and m are unknowns. It should conclude that the system has exactly one solution (expressed as a function of one of the knowns), namely l4=m=n+10 and l3 = n.
  - One-Way
  - - The simplest case of a one-way constraint is the expression that computes the size of a returned array from the function arguments.
    - Should one-way constraints between parameters or computing a parameter size from that of the return value be written in the type, or as a preconditions?
Generalized Runtime-sized Types
- Any type whose size is not determined until the program runs.
- - Any structure that has runtime-sized arrays as fields qualifies. Is there any other case?
- In general same rules apply as with runtime-sized arrays.
- Todo
- - Can a structure contain more than one runtime-sized member? Can they appear anywhere, or just at the end?
  - How are they defined and used? In particular how is their size parameter tied to the parameters of sub-objects?