Data Races and Memory Safety

One of the biggest selling points for Rust is that it is a "memory safe" systems programming language. Rust ensures that the system is memory safe through a variety of language features, including definite assignment, ownership semantics, and prevention of mutable aliasing. The last feature, prevention of mutable aliasing, is particularly interesting. It produces a guarantee of absence so-called "data races" where two threads race on shared mutable state. This is an important part of Rust's memory safety and an important deficit of languages like C.

Consider a simple C program with two shared variables, a pointer and a length, that may be written and read on multiple threads.

struct buffer
{
    char* data;
    int len;
}

bool read_file(buffer* buf, FILE* file)
{
    return fgets(buf->data, buf->len, file);
}

Suppose that the buf parameter to read_file is aliased and the data field is being realloc'd in another thread. In that case, the actual size of the allocated memory pointed to by data and the len field may be out-of-sync -- if read_file executes in between field writes. This would produce an out-of-bounds memory access and potentially a security issue. Rust prevents this situation from occurring by guaranteeing that, inside read_file, buf does not concurrently have any writable references.

Given the above, you might assume that data-race-safety is necessary for memory safety. However, that's not true. C# is memory safe, but is not data race safe.

To explain, I'll first provide my definition of memory safety: a memory safe program is one where only memory owned by the program may be accessed[^1]. In other words, all memory accessed by the program must have been allocated by that program, and must still be considered valid at all points of access. This definition might be weaker than some other definitions. For example, this definition doesn't prohibit variables holding invalid values. However, this definition does cover almost all types of CVEs classified as "memory safety" issues by CWE.

Now, we can demonstrate that C# meets these requirements. C# provides the following ways to access memory:

  1. Directly, through a primitive operation on a primitive type (e.g., + on an int)
  2. Through a field access on either a struct or a class
  3. Through an array access
  4. Through a "by-ref" variable or a ref struct type
  5. Through a Span<T> or ReadOnlySpan<T> type

C# is memory safe if these accesses may only happen to data owned by the program, while that data is valid.

For (1), this is mostly straightforward. C# primitive types consist of numeric types and strings. For struct types, the memory is managed directly by the C# runtime -- it is not possible to have a variable of primitive type where the lifetime of the memory is shorter than the lifetime of the variable. For strings, the only non-struct primitive type, the analysis is similar to arrays, which will be covered separately.

For (2), the field access is either on a struct or a class. For all types, the runtime ensures that the field access is a legal one for the type, meaning that only fields declared on the types, which have reserved memory, are legal targets. For a struct, the memory for each field is embedded directly within the struct. That means that the field access is valid if the lifetime of the containing variable is valid. For local variables and parameters, this memory is managed directly by the runtime and again the variable scope cannot exceed the variable lifetime. Fields of reference types are slightly different. Unlike structs they exist behind a pointer to dynamically allocated memory. This pointer can be shared between threads and therefore the same memory location can be seen and accessed by multiple threads. In this case memory safety is also provided by the runtime, but by the garbage collector (GC). The GC ensures that the memory pointed to by any class variable will be valid while any references to the memory exists. Therefore, a class reference to a memory address cannot exist past the lifetime of the memory. Moreover, the runtime guarantees that all class variables will contain either a reference to a valid memory location of appropriate size, managed by the GC, or the null pointer. This guarantee is true even in multi-threaded execution. A class reference will never observably contain a pointer to invalid or unowned memory.

For (3), an array access occurs on a variable of array type. Array types are GC-tracked objects of a dynamically allocated size and element type, or null. The runtime guarantees that the following will always hold true for variables of array type:

a. The memory pointed to by the array variable will always have a lifetime equal to or greater than the scope of all references to the memory. b. Accesses to any element of the array is bounds-checked -- accesses before the start of the array or after the end of the allocated memory are caught and handled by the runtime through memory-safe exceptions. It is not possible to observe an element out of the memory bounds of the referenced array. Accesses to null arrays also generate memory-safe exceptions. c. Modifications to array variables are guaranteed to be performed automically. Meaning, if a variable with an array reference is modified to contain a different array reference, the underlying reference to memory and information about the array's bounds are guaranteed to be updated with no interleaving reads or writes.

The above array rules ensure that the memory backing an array lives at least as long as all of its references and no accesses outside its bounds can be performed without a memory-safe exception occuring. Strings follow the above rules, but have additional restrictions on mutability of elements. These additional rules don't affect memory safety.

For (4), "by-ref" variables are references to other variables in the program. The C# language ensures, through so-called "ref-safety" rules, that all ref variables are memory safe and that the variable pointed to by a "by-ref" variable must always have a lifetime at least as long as the variable itself. ref structs are a combination of the other types of variables already discussed, or "by-ref" variables, which have already been established as safe. Using the same rules as (2), we can consider ref structs safe through induction. There are two ref structs which are special, Span<T> and ReadOnlySpan<T>, which deserve special consideration and will be handled separately.

For (5), Span<T> and ReadOnlySpan<T> are ref structs and should be considered memory safe by the same "ref-safety" rules from the previous section. This is true, but it's worth noting exactly which rule helps in this case. Span<T> initially looks like an array, but it only has two of three of the same guarantees: the referenced memory will always out live the reference, and the element accesses are bounds-checked. It does not guarantee that modifications are performed atomically. However, the ref-safety rules prohibit by-ref variables or ref structs from appearing on the heap. This means that multiple threads cannot access the same Span<T> variables. Therefore, the lack of atomic update cannot produce an observable skew between the memory reference information and the memory bounds information.

Given the above rules, that all other types are composites of the above, and that all of the composites are decomposed by field access, we can assume by induction that all C# types are memory safe.

[^1] Note that almost all languages also have an "unsafe" subset that is explicitly not designed to be memory safe. C# does as well, and I am only considering the safe subset in this analysis.

social