Explicit Signage
Integers seem like a simple concept. Start with natural numbers (ℕ⁺), incorporate 0 (ℕ⁰), then add a sign to create integers (ℤ). In theory, integers range from -∞ to ∞, but computers currently lack infinite memory, so we have to compromise in implementation. And compromise we do! Some high-level scripting languages try pretty hard to make integers behave like the mathematical definition, but at great computational cost. Most systems programming languages instead utilize high-performance CPU features to provide some combination of limited-range 2’s complement signed integers and unsigned integers which provide modular arithmetic semantics (i.e. fixed-width wrap-around behavior).
It’s bad enough that signed integers in languages like C and C++ have undefined underflow/overflow semantics. Adding insult to injury, it’s common for languages to implicitly convert between types of differing bitwidths! This can produce some very surprising results because the implicit conversion doesn’t consistently happen soon enough to maximize the fidelity of intermediate values.
For an interesting example of how implicit conversion can go wrong, see the analysis of Stagefright vulnerabilities in this blog post. The blog post presents an alternative solution to integer conversion, and is worth reading in its entirety.
Hemlock takes a rigid approach to integer types that resolves some historical baggage and removes footguns.
- A broad range of fixed-width unsigned/signed integer types are built in, currently powers of two
from 8 through 512, e.g.
u16
andi256
. In addition nearly-arbitrary-precision (max ~2³² bits)nat
(ℕ⁰) andzint
(ℤ) types are built in. - The default numerical type is 64-bit unsigned integers, known as type
uns
oru64
. Constants like0
and42
are of typeuns
, as are0u
and0u64
. This appears to be a big departure from the status quo, but in my experience as a systems software engineer I’ve found unsigned integers to be overwhelmingly more applicable than signed integers. I’d rather have to carefully craft lower bounds checks to avoid wrap-around than have APIs which imply that negative values make sense. - All other numerical constants must have explicit type suffixes or other distinguishing
characteristics. For example:
-1i
and13i64
are 64-bit signed values.7u8
is an 8-bit unsigned value.-15z
is an arbitrary-precisionzint
.77n
is an arbitrary precisionnat
.0.
,3.2
,4r
, and7r64
are 64-bit reals, and17.3r32
is a 32-bit real (based on IEEE 754).
This might seem overly verbose to C/C++ programmers, but in fact it enables concision in the context of type inference (i.e. no associated explicit type declarations are necessary).
- Modular arithmetic semantics apply to both unsigned and signed fixed-width integer types. In this modern era we can assume that signed integers are implemented via 2s complement. So long, 1960s-era design decision!
- Conversions fall into two categories, e.g.:
- Bitwise:
Uns.bits_to_i64
bit-casts,Uns.trunc_of_i128
truncates,Int.extend_to_u256
sign-extends (unsigned types zero-extend). - Value-preserving:
Uns.like_of_int_opt
succeeds if the input is non-negative,Uns.narrow_of_u128_opt
succeeds if the input is in the range supported byuns
,Int.widen_to_u128_opt
succeeds if the input is non-negative.
- Bitwise:
I have literally spent decades now fighting the good fight against implicit integer conversion while writing low-level C/C++ code. In Hemlock, all conversions are explicit, and I look forward to a rich set of numerical types that produce exactly the results one would expect when reading the code.