Should collection/sequence/array indices start with zero or one? Most current languages choose zero. For *flux,* I am choosing one. This choice is orthogonal, meaning that I can easily change it if it turns out to be wrong. The reason to discuss such a trivial issue is that it is an example of how choices that made sense in the early days of programming need to be reexamined. It also frames some principles of language design: Abstract Datatypes, and Conservatism.

My goal is to make programming easier, so I want to reduce cognitive effort wherever possible. Human culture, language, and even mathematics have a deeply entrenched convention that you count things starting with one. We say something is the “first”, not the “zeroth”. Mathematicians describe sets as {x_{i}}_{i=1..n} not {x_{i}}_{i=0..n-1}. Novice programmers find zero-based indexing surprising at first. Using different conventions in different contexts is not just a one-time learning cost, but a continual cognitive cost. I always have to remember to subtract one from the length to get the last element. These little bits of wasted attention are small, but add up, and we can’t afford to waste any.

I take it as a principal of programming language design that entrenched conventions should be contradicted only for compelling reasons. Call this principle Conservatism. Assembly Language and C have a good reason: zero-based indexing is convenient for pointer arithmetic. But there is no pointer arithmetic in a sane high-level language. So why is zero-based indexing so prevalent?

Dijkstra argues for zero-based indexing. His argument is in two parts. First, he argues that intervals of integers should be specified as: *first *≤ i < *last*+1. He then argues that the indices of an array are “nicer” as 0 ≤ i < *length* rather than 1 ≤ i < *length+1*.

Ladies and gentlemen of the jury, Dijkstra is committing **representation exposure**! To be fair, we all did that back then. He is assuming an integer interval *is *a pair of integers. That may be its internal representation, but that is irrelevant to language design. We need to consider what should be the abstract interface to the interval datatype. An interval should have at least the properties *first*, *last*, and *length*. We can construct an interval by supplying any two of these properties, and there should be literal syntax like *first*..*last *and *first*++*length*. Maybe there is also a *next *property one greater than *last*, and maybe the literal syntax should be *first*..*next * as Dijkstra wants. Regardless, once we have properly abstract intervals, the choice of initial index becomes arbitrary. The range of indices of an array of length *n *can equally well be either 0++*n *or 1++*n*.

The point here is that by properly abstracting intervals, we can be precise about the distinction between indices and lengths, and encapsulate the arithmetic to relate them. All these +1 and -1 operations are really hard-wired conversions that should be abstracted away. I conjecture that the lack of such abstraction is why fencepost errors are so common.

Thus, without any compelling reason to choose zero-based indexing, the principle of Conservatism tells us to use one-based indexing. Unless we value similarity to existing programming languages more than to natural languages, in which case the same principle tells us to do the opposite. But in either case it is an arbitrary convention. It is high time for us to get over Assembly Language. I would go even further – I think some of the enthusiasm for zero-based indexing stems from intellectual vanity. “Look at me: I’m so smart I count differently from normal humans.” Such self-defeating cleverness is a plague upon our profession.

I always thought zero based indexing was better because in a base 10 system you need the zero to get 10 single-digit numbers. I’m picturing the rolling over of numbers:

00

01

02

03

04

05

06

07

08

09

10 <- 11!

The rollover point is actually the 11th number. This is more intuitive for binary of course:

00 = 0

01 = 1

10 = 2

11 = 3 <- 4th!

But shouldn’t this be the same principle? Is it more or less confusing to treat base 2 differently than base 10? Isn’t it more aesthetically pleasing to be 0 based? It may be true that it’s more confusing for people when they start programming but perhaps that’s because the traditional way it’s taught is more incorrect?

Machine arithmetic is always modular arithmetic. Whether it is computing modulo length of the array or modulo word size, the numbers roll over to the beginning. It seems unnatural to me that “mod n” operator should ever return “n” instead of “0”.

Mathematicians also use 0-based notation when enumerating elements of finite fields Z_p. It just makes more sense since 0 denotes the same identity thing no matter which p is chosen.

Kuat – surely you are joking? Unchecked integer overflow when calculating lengths and indices? That is as unthinkable as not having array bounds checking.

Of course we can have modular arithmetic, and use 0 whenever it is appropriate mathematically. The smallest length is 0. But it is natural that the smallest ordinal position be 1, and I don’t see any real reason why it should be 0. Many mathematicians still say that 0 is not a Natural number.

This is silly. For 95% of the cases, one should be using higher-order functions to transform collections, and thus would not need index-based access to begin with. About the only time I’ve used index-based access in recent years is when doing high-performance operations on rectangular arrays, which are far simpler to compute when using 0-based indices.

Yes, it is almost a nit. But string operations use indices and lengths, so we can’t completely ignore the issue. Could you give an example of those array operations where 0-based indices are an advantage?

The SML standard library has a type of “substrings” with many useful operations that don’t involve explicit indices or lengths:

http://sml.sourceforge.net/Basis/substring.html

The standard higher-order functions like map, fold, and so on are also exposed for strings in other modules. I rarely use indices for any purpose in SML programs, even those with strings everywhere.

I wonder if it would be possible to completely eliminate indices without losing functionality? But I also need them for another reason: sequence equality. My data model is a labelled tree. I need some labels for the sequence element edges that gives the correct notion of equality. Ordinal numbers solve that problem.

Not entirely convinced. When programming in Eiffel (which uses 1-based indexing), I found myself adding one to expressions that calculated indices more often than I subtract one from length calculations in languages with zero-based indices. But that’s because I rarely calculate the length except in low-level languages.

In a decent collections API you’re not going to need to calculate the end index anyway because the collections have convenient operations for getting the last element, or indexing from the end of the collection, or traversing the elements without caring how many there are.

OK, that’s another data point. Of course these things depend on the whole API design. I believe that properly distinguishing between indices and lengths throughout the API should eliminate ever needing to hardwire a +1 or -1.

In Mathematica I’ve found the 1-based indexing slightly more “nice” than 0-based in other languages. I can’t even quantify the preference though, but it seems to harmonize with the general design of the language; the 0 index addresses the “head” of an expression, the higher indices its arguments. But in Mathematica having/wanting to use indices is rare anyway.

So anyway, I feel it’s really an arbitrary choice, and as you say, by convention it could be either way. Maybe there are other elements of your language that point to one of the choices?

I think it is probably arbitrary, and so the conventional choice should be taken. Or maybe I’ll hold a vote. What I find surprising, and what triggered this post, is that there is a widespread belief that 0-based indexing is better, but I can find no solid justification for that belief except for pointer arithmetic.

I think this is a problem thta is rooted in encoding, and therefore will vary according to the problem domain.

For example, one of the most baffling APIs I ever worked with was the Java Date API, which is zero-based. If I recall correctly, passing a string “0/0/00” would return back to you January 1st. According to my friend Jeff who is a C programmer, he says this is the right way because that is what system’s programmers expect. I find this viewpoint to be folly. Date’s are about problem domain abstraction. What is this Date supposed to represent. Encoding zero-based indexing into a date definition simply inverts the problem domain and ruins the point of abstraction. If anything, the Date constructor should be bifurcated into two factories, and these factories should conceal the construction strategies, even though they each pass different string encodings into their factory constructors.

So you are wrong! 😉

Also, I remember Joshua Kereivsky giving an example of Ward Cunningham writing a Date API wrapper for some Java code, and him saying he was absolutely stunned by how simple it was.

Date(December.15, 2000) was the constructor. He basically created a bunch of static fields on a class called December that represented December 15th, in any month. And so on for every day of the year. Ward pretty much “spelled out” the whole solution rather than relying on algorithmics to decode the string. Wonderful! In this regard, Ward obviates the encoding problem altogether by forcing an explicit representation that can also be checked by the type system. The only problem remaining is an invariant between the December.15 superclass MonthDay and Year.

Oops, forgot this is Java code. The constructor was probably Date(December.FIFTEEN) — you get the idea.

0/0/00 sounds like another example of conflating ordinals with cardinals, or positions with distances. A point in time should be a different type than an interval of time. Adding an interval to a point gives you another point. Adding two intervals gives you another interval. You can’t add two points together, but subtracting them gives you an interval.

I’m a start-at-zero kind of guy, but if you really want to do away with any intellectual vanity, then copy the approach of visual basic or even javascript when it comes to arrays since they let you have it either way. In VB you declare an array of say size 10 and it actually allocates 11 slots and lets you use either 0-9, 1-10, or even 0-10 if you really feel the need, or you can specify the index range you want. VB will tell you that the length of the array is 10. JavaScript arrays are superficially 0-based, but they behave much more like hash tables, so they let you use any indices (even negative) and never give index-out-of-bounds errors, they just return the ‘undefined’ value when reading an index that hasn’t been assigned.

Both approaches tend to stop fencepost errors, and given that these are the 2 languages that have had more people learn (or at least write some code for) than any others, you could argue strongly that these approaches are easy for people to learn.

Those are some laid-back languages. But even they can’t escape the question: how do you access the first character of a string?

Good. I like 1-based indexing. I’ve used it in the early days of my career in Delphi. They actually had a reason for it. They stored the length at element 0 of any array. This dictated that a string could only every be 255 chars long as that was the maximum number that could be stored in the first element.

One pitfall I saw back then was when we started using COM interop. The COM objects we were using used 0 based indexing. This leaded to confusion more then it did using 0 based indexing overall. It would be nice if you could also abstract external 0 based indexing to look like 1 based indexing.

Indexes point most naturally to the gaps between a sequence objects rather than the objects themselves. So, for instance, imagine a sequence of five objects:

A B C D E

Now let’s draw bars to represent the gaps, including the two outside:

| A | B | C | D | E |

Those six bars naturally correspond to the six numbers 0 1 2 3 4 5. When we come to index the objects themselves, it’s up to us whether to chose the bar to the left or the bar to the right. For myself, I’ve come to a slight preference to picking the bar to the left, i.e. zero-based counting, but I think ultimately it’s arbitrary.

Don’t be too sure that what happens to be done commonly is always the most natural or sensible; bear in mind the Roman “inclusive” method of counting, by which for instance the day after tomorrow was called “the third day” rather than “two days from now”, and the musical interval from C to C is known as an octave.

The 1950s called. They want their argument back.

As you describe, one-based indexing is the standard in “counting” and in programming languages that are closely tied to mathematical conventions. In addition to Mathematica, I’d cite R, matlab/octave, and computer algebra systems.

Just to raise a counterpoint, though, lets consider the case of laying out an m x n matrix in column-major order with one-based indexing. To access the (i,j) element we need to think a tiny bit: i+(j-1)*m, and pulling out subarrays gets more confusing. Eventually it’s confusing in either convention though, so I don’t pretend that this example is conclusive. A person can readily get in the habit of thinking in either system when the need arises.

OFFtopic, but interesting: This article is about automatic refactoring scenarios. In itself, the whole article is a great argument against 1)textual code representation and 2) side effects, with which even a simple refactoring case (eliminating a single variable) brings up undecidable questions.

http://blogs.msdn.com/b/ericlippert/archive/2011/01/13/not-as-easy-as-it-looks.aspx

Just to re-assure Jon: you are on the right track!

Why not let the programmer choose the most appropriate index range for his purpose.

In Ada arrays can be declared with arbitrary lower and higher indices.

Indices can also be of an enumerated type.

Examples from the Ada 95 reference manual :

Grid : array(1 .. 80, 1 .. 100) of Boolean;

Mix : array(Color range Red .. Green) of Boolean;

VHDL goes further by allowing ascending or descending index ranges :

type Byte is array(7 downto 0) of bit;

type Vector is array(0 to 255) of Byte;

VB6 has similar functionality, See: http://msdn.microsoft.com/en-us/library/23s0z8fy%28VS.71%29.aspx

VB6 also had Option Base {0 | 1} to override the default of Option Base 0 (zero-indexing).

Wow, I wonder what VHDL’s rationale is for that? But in any case we still need to choose a default range to use for strings.

Hardware mentality, I would guess. Bit7 is the highest order bit in a byte, but is also left-most when written.

You can always do what Iverson did in APL – supply an “Index Origin” operator (page 340 of http://www.microapl.co.uk/apl/APLXLangRef.pdf ) and procrastinate until you feel strongly enough about one or the other choice.

From a usability standpoint you are correct. 1 based indexing is far more intuitive and is what all novices expect. An important question is: “is your language for novices”? We still use Qwerty keyboards even though the Dvorak design is more efficient. Using the less efficient, common design is still easier than remembering both.

Good question Lach. Yes and no. I am targeting future generations of programmers, who will all start out as novices, but won’t stay that way for long. I am ignoring issues of compatibility with existing languages. But I am very concerned with what it takes to scale to large teams of professionals building complex user-facing applications.

Well, a novice with one language in hand is one thing. As soon as they leave this wonderful 1-indexed world they will be right back to the same continual cognitive dreariness you seek to have them avoid. In the end it won’t matter. Novices will wonder why you use 0-based indexing if you do, and then swear at the next language because it does not OR they will have learned to get along with 0-based indexing and wonder why you are trying to be different? IMO, I’ve seldom found that a 1-index would be ‘easier’ and aesthetics mean little in the midst of 20000 lines of code. Consistency counts for much much more.

While i < length is easy to write grabbing the last character usually has a builtin for it.

In the end people will wonder why you did it different or swear because it messed up their consistency somewhere. Damned if you do…. blah blah

I suspect that whichever way you go will also depend on other things like code validation apps, compilers, etc.

Well… 0 or 1 based index should be consistent with the use:

– when enumerating, ordinals are used and start at 1

– when using size, cardinals are used and start at 0

When looping on an array’s elements, most of the time, we are enumerating so 1-based indexes make sense. You should note that mixing the first element and the size of the set is a kind of a non-sense (between 0 and length-1). Maybe differencing type AND units for variables might be a good idea (even if I don’t know if that would fit the language requirements): simple variable and array are different type in most langages…

For all of theses reasons, 1-based indexing seem to be the best.

However, a language has to be used for day-to-day work by day-to-day programmers… that are mostly used to use 0-day indexing. So using a different approach might be really confusing and you should be carefull to provide not only “out of bound” check but also “missing last index” warning for them.

I think that the idea of having defined range indexing by programmer (like Ada, see above) with a well designed API using iterators to abstract counting explicitly might fit the bill. Just think about some common use-case like: do something for even index and something else for odd index… so you need to have some kind of filter for indexes.

And why not using integer-hashmap… to allow “holes” in sequences if necessary ? The programmer might decide where to start, where to end and what to enumerate…

Just my 2 cents…

Niklaus Wirth went from definable lower array bounds to the fixed lower bound 0 when designing Oberon, the successor language to Pascal and Modula-2.

One of the reasons for this was that it makes index checks more efficient.

My favourite solution? Number the fenceposts, not the elements.

Given a fence like this:

`|=|=|=|=|=|=|`

I agree that neither of these is drastically better than the other:

`|0|1|2|3|4|5|`

|1|2|3|4|5|6|

But it’s hard to argue with this scheme:

`0=1=2=3=4=5=6`

Then using numbers as insertion positions is natural, and instead of numbering the elements, you can think of the elements after post n (which works out like zero-based) or before post n (which works out like one-based).

Of course, the far more interesting thing to look at is ranges, and what happens with the obvious attempts to split them. These are all, of course, different ranges: [a,b] [a,b) (a,b] (a,b)

The problem with [a, b] is that it splits as [a, k] and [k, b], and both ranges contain k, something that’s rarely desired in algorithms. The is the natural kind of range for 1-based numbering, since the canonical loop for (int i = 1; i <= n; ++i) is over [1, n].

The opposite, (a, b) splits into (a, k) and (k, b), which never processes k. This is exactly what quicksort wants to do after partitioning "at" k, interestingly. Lookup in a binary search tree can also be thought of this way, until you want one with duplicates, where a recurse is one of the two following forms. Similarly, binary search in an array seems like it would want this, but that implies stopping when you find it, which takes more comparisons on average than just always recursing to the end. I've also never heard anyone that thinks (-1, n) or (0, n+1) is a natural way of thinking about a range of numbers 🙂

The half-open/-closed ranges are the only ones with no holes or overlap, which is the usual situation in divide-and-conquer, and especially important if you're using reals or other types where it's

impossibleto just apply a +/-1 correction factor.Both (0,n] and [0,n) are arguably just as good in theory at representing the range, and naturally imply 1- and 0-based numbering respectively.

So I suppose it all comes down in the end to whether you prefer the get_next code before the condition test or after the loop body:

`for (int i = 0; ++i < n;)`

vs.

`for (int i = 0; i < n; ++i)`

Ever been tempted to have the for loop in your language work differently?