Nov 25

Remove works correctly

Formalizing the correctness of remove is just organizing all the existing arguments we gave on the previous page. While developing remove, we started with the remove procedure itself and then investigated what was needed to make it work correctly. To organize all these ideas into a concise argument, we start with what we know we will eventually need and work in the reverse order to how we worked on day36.

Definition A red-black-doubleblack tree is a binary search tree where in addition each node is colored red, black, or double-black. There are no adjacent red nodes (the no-red-edge property). Let the black-height of a root to leaf path be the sum of the colors on the path, where red counts \(0\), black counts \(1\) and double-black counts \(2\). In a red-black-doubleblack tree, each root to leaf path has the same black-height (called the black-height property).

Fixing lemmas

Lemma 1 Let \(v\) is the root of a red-black-doubleblack tree where \(v.left\) is black, \(v.right\) is double-black, and there are no other double-black nodes. Let \(v' = fix\_right\_doubleblack(v)\). Then \(v'\) is the root of a red-black-doubleblack tree with the same black-height as before. In addition, there is at most one double-black node and if a double-black node exists it is \(v\) itself.

Proof The proof is the same as the argument already given during the remove discussion and can just be copied here.

Lemma 2 Let \(v\) is the root of a red-black-doubleblack tree where \(v.left\) is double-black, \(v.right\) is black, and there are no other double-black nodes. Let \(v' = fix\_left\_doubleblack(v)\). Then \(v'\) is the root of a red-black-doubleblack tree with the same black-height as before. In addition, there is at most one double-black node and if a double-black node exists it is \(v\) itself.

Proof The proof has already been written as part of the remove discussion; it just needs to be copied here.

Lemma 3 Let \(v\) be the root of a left-leaning red-black-doubleblack tree where there is at most one double-black node and if it exists it is either \(v.left\) or \(v.right\). Let \(v' = fix\_doubleblack(v)\). Then \(v'\) is the root of a left-leaning red-black-doubleblack tree with the same black-height as before. In addition, there is at most one double-black node and if a double-black node exists it is \(v\) itself.

Proof Again, the proof has already been written as part of the remove discussion; it just needs to be copied here. The argument to copy is the part about why there are only three cases of the double-black child and its sibling color. When \(v.left\) is red and \(v.right\) is double-black, a flip_right followed by Lemma 1 above fixes the problem. The remaining cases with a black and double-black child just refer to either Lemma 1 or Lemma 2 above. The left-leaning property can be shown to be maintained by referring to the lemma from insert which fixed a left-leaning violation by flipping left.

Extract min

Lemma 4 Let \(u\) be the root of a left-leaning red-black tree and let \(v\) be any tree node. Let \(u' = extract\_min(u, v)\). Then \(u'\) is the root of a left-leaning red-black-doubleblack tree with the same black-height as before. In addition, the tree rooted at \(u'\) consists of all elements from the tree rooted at \(u\) except the element with smallest value, and there is at most one double-black node and if such a double-black node exists it is \(u'\) itself. Finally, after \(extract\_min\) returns, the value of the smallest element from among the subtree rooted at \(u\) is copied to be stored at \(v\).

Proof Again, we have mostly already described the proof on the previous pages. The proof works by induction on the size of the tree rooted at \(u\).

Remove

Lemma 5 Let \(v\) be the root of a left-leaning red-black tree and let \(x\) be any element. Let \(v' = remove\_helper(v, x)\). Then \(v'\) is the root of a left-leaning red-black-doubleblack tree consisting of all elements from the tree rooted at \(v\) except the element \(x\). In addition, the black-heights of \(v\) and \(v'\) are the same and there is at most one double-black node and if such a double-black node exists it is \(v\) itself.

Proof The proof is by induction on the size of the tree rooted at \(v\).

Optimized Red-Black Tree

The pseudocode and discussion of red-black trees focused on correctness and big-O execution time. How is something like red-black trees translated into an actual implementation? For example, red-black trees are used in the Linux kernel and you can see their implementation in rbtree.h and rbtree.c with some documentation in rbtree.txt.

If you look closely in rbtree.c, you can see insert and remove (called erase) that are performing most of the same operations. For example, push_black during insert starts on line 120. You can also see the flips. But there are differences as well: the use of parent pointers, a loop instead of recursion, and if you look in rbtree.h the struct definition has a weird way of storing colors. (Also, the implementation is not left-leaning and just fixes left and right in symmetric way.)

Despite these differences between our pseudocode which is recursive and this actual implementation, it is still good design to have gone through the recursive psuedocode first. Recursion matches almost exactly with the inductive proof of correctness making our task of verifying correctness easier. Then once we are certain we have the correct psuedocode with acceptable big-O execution time, we can think about making it as efficient as possible. Throughout this optimization process, we then verify that each optimization we carry out still causes the code to act the same as our pseudocode.

Tail-Recursion and parent pointers

Whenever we have a recursive procedure that we would like to optimize, our first question is if it can become tail-recursive. If it can be made tail-recursive, we can either manually convert it to an efficient loop or rely on the compiler performing the optimization. The task of making our insert and remove tail-recursive is to eliminate everything that occurs after the recursive call. But the actions the pseudocode performs after the recursion returns must execute for the insert procedure to work properly so how can they be removed? We don't remove them; instead we move these actions so that they are performed by the callee. So instead of recursing and then performing some actions, we will recurse and make our callee take these actions before returning, leaving us with nothing to do after the recursive call returns. Here is an example (actually we have already seen this example back when we talked about treap remove).

In insert, we recurse using a call such as

g.left = insert_helper(g.left, x)

Setting g.left to contain the return value is something that occurs after the recursion returns. So instead, we are going to make the call to insert_helper directly update g.left to hold the new parent before it returns. We even get some additional efficiency because we can only update the value if it actually changes. To do so, we can make the parameter to insert_helper a C++ reference since C++ references allow the callee to directly edit the value. So for example, we could write the following declaration for insert_helper.

void insert_helper(shared_ptr<Node> &g, Tkey key, Tval val) {
    // Code here that can set g to a new value if needed
}

The same optimization can be made for remove. What about the other actions performed after recursion returns? For insert, it was code such as the following fragment:

if is_red(g.left) and is_red(g.right):
    assert(g.color == black)
    push_black(g)

We want to change the code so that the callee instead performs these actions. Therefore, a node should check its parent and grandparent to see if there is a red-edge and if so fix it. The tests and fixes are the same but just from the perspective of the grandchild. To allow the grandchild to perform these tests, we need to have each node store a pointer to its parent. (We can't pass the parents as parameters since there might be arbitrarily many parents needed during the fix (as many as the height of the tree) and at that point we might as well just use the call-stack and not implement a tail-recursive function.) Once we add parent pointers, we can change the code to test for and fix the red-edge during insert exactly at the point once we add the new node. A similar transformation can occur for remove. After doing this so that directly after adding a new node or removing a node we fix all problems using parent pointers, we end up with the loops presented in the book.

The double-black node

Another optimization is that we don't need to store the double-black color in the tree itself. Note that the double-black node exists only during remove and is always the root of the subtree or one of the children. When it is a child, we immediately either fix it or move to be the root of the subtree. We could therefore just use a boolean variable to store if the current subtree root is double-black or not and don't need to store the double-black color in the tree node itself. Even better, once we fix the double-black node, we could just return since there is nothing more to fix. Thus we don't even need a variable, we just maintain that as long as the loop which is fixing the double-black node continues to execute it means the node we are currently located at is double-black.

Storing the colors

Data structures in all modern CPUs must be aligned, meaning that the memory holding a struct must start on a memory address which is a multiple of four (on a 32-bit x86 CPU) and a multiple of 8 (on a 64-bit x86_64 CPU). A pointer to a node struct consists of a memory address and since it all node structs start at multiples of 4, any pointer will have the bottom two bits zero. In other words, it is impossible for a node pointer to have anything besides zeroes in the bottom two bits. If we are thinking about saving memory, we can use these two bits to store the color. Therefore, inside the parent pointer we will store the color of the current node in the bottom bit. Of course, care must be taken to always remove the color from the parent pointer before trying to actually use it as a pointer. For example, from the linux kernel rbtree.h:

struct rb_node {
    unsigned long  __rb_parent_color;
    struct rb_node *rb_right;
    struct rb_node *rb_left;
}

Note how the __rb_parent_color property is unsigned long (guaranteed to be the same size as a pointer). The code will never directly access this property, instead to get the parent pointer it uses the following macro also in rbtree.h which clears the bottom two bits and casts the result to a pointer.

#define rb_parent(r)   ((struct rb_node *)((r)->__rb_parent_color & ~3))

Similarly, there are macros which test the bottom bits of __rb_parent_color to obtain the node's color. With this, the code in rbtree.c should be understandable (there is some WRITE_ONCE macros which is a memory barrier to protect multiple CPUs executing this code at the same time. If you are interested, the full details are documented here. Another good resource is this series of talks by Herb Sutter, one of the designers of C++).

Exercise

In red-black-with-remove.cpp I started implementing remove. I implemented the removal but did not implement any of the code to fix the violated red-black properties. To see what I changed, here is a diff red-black-remove.diff which shows just the changes made to implement remove. Your task is to fill in the fixing code: