home

# Arrays and Memory, Part 2

19 Apr 2012

Not too long ago we talked about why arrays start at 0. Repeating what I said in that post, those three lines really are full of learning goodness. Since I obviously feel strongly about it, let's expand on what we've seen so far. This won't be nearly as fundamental, but I still find it quite interesting.

Here's the C code that we're going to learn from:

```  char *the = malloc(4 * sizeof(char));
char *cat = malloc(4 * sizeof(char));

the[0] = 't';
the[1] = 'h';
the[2] = 'e';
the[3] = '\0';

cat[0] = 'c';
cat[1] = 'a';
cat[2] = 't';
cat[3] = '\0';

printf("%s", &the[4]);
```

I want to draw your attention to the fact that `the` only has enough space allocated to hold 4 characters, which we fill with the value 'the\0'. (\0 is just how C knows that it has reached the end of the string, I'll blog about that some other time). However, notice that we're actually trying to print a string starting outside the bounds of `the`.

When you try to go beyond an array's boundaries, modern languages typically handle it one of two ways. They'll either return a default value, like the following Ruby code which returns nil `['a','b','c'][100]`, or they'll generate an error, like .NET throwing an `IndexOutOfRangeException`.

I'm not sure how this is accomplished, but I can take a guess. In these languages, an array is a combination of the space allocated for values, as well as its length. Whenever a position in the array is accessed (for writing or reading), the code makes sure that `i < length`.

In C however, an array really is nothing more than a pointer to a continuous chunk of memory. The initial size of the array isn't something that's kept around. The benefit is that arrays have little memory overhead (a single pointer) and little performance overhead (no extra checks, just simple pointer arithmetics). That might seem extreme, but remember that 386s ran at 16MHz, and that was fast!

With that out of the way, what will the above code print?

The answer is that there's no way of knowing. The only thing we know is that we'll be accessing memory located at `the + (4 * sizeof(char))`. That memory could be protected by the OS, or not even exist, and end up causing a segmentation fault. That specific location could happen to hold your banking password, and we'll end up printing it out. It's even possible that our two calls to `malloc` allocated memory side-by-side, which means that the output will be cat.

This Wild Wild West approach to memory access is the cause for many security vulnerabilities, normally in the shape of a buffer overflow. Consider the following code:

```  char *name = malloc(10 * sizeof(char));
gets(name);
printf("you entered: %s", name);
```

The `gets` function doesn't do any bound checking, which means it'll read from STDIN as much as we type. If we type 15 characters, it'll store all 15, meaning 5 characters will overwrite memory which wasn't meant for `name`. What if the memory you are able to overwrite is critical? Data stored in memory isn't just for reading and writing, it's often executed, say in the form of the address of a function to execute.

The solution is to use something like `fgets` rather than `gets`, which let's us specify a size:

```  char *name = malloc(10 * sizeof(char));