Nov 4

Set operations

This isn't in the book, but the following methods are useful additions to the USet ADT. Remember that we can think of the elements as a pair of a key and a value.

union(S1, S2): union takes as input two USets S1 and S2, and returns a new USet S3 which contains all elements that appear in either S1 or S2. The input USets S1 and S2 are not modified. The only question is what to do when an element appears in both S1 and S2. In that case, we decide that the element from S1 wins out which we call a left-biased union. But this choice is not universal, so whenever you use union you should check the documentation for how duplicate elements are handled.

Interpreting the union operation when the elements are a pair of a key and a value, union returns a new USet with the union of all keys from S1 and S2. If a key appears in both S1 and S2, the value from S1 is used. Otherwise, the value attached to the key in its only appearance is used.

difference(S1, S2): difference takes as input two USets S1 and S2, and returns a new USet S3 which contains all elements in S1 which are not in S2. Neither S1 nor S2 are modified.

Interpreting the difference operation when the elements are a pair of a key and a value, difference returns a new USet consisting of all keys from S1 which do not appear as a key in S2. The values from S1 then used.

intersection(S1, S2): takes as input two USets S1 and S2, and returns a new USet S3 which contains only elements which appear in both S1 and S2. The element from S1 is the element returned in S3. Neither S1 nor S2 are modified.

Again, interpreting the difference operation when elements are key-value pairs, only keys which appear in both S1 and S2 are added to S3. The value from S1 is used.

Persistent Data Structures

A persistent data structure is a data structure where the data structure is never modified in place; instead each operation returns a new data structure. All of the above operations are persistent operations since they do not modify the data structure. At first glance this seems inefficient: every operation creates a new data structure requiring copying all the data.

But there is a neat optimization that can occur if we know that all operations on a data structure are persistent: we don't need to copy all the data and can re-use most of the structs in multiple data structures.

Consider this figure from wikipedia. Originally, there was a black binary search tree which was referred to by xs. Think of xs as a pointer of type shared_ptr<Node>. The picture shows the result of adding e into this binary search tree. First, starting from the black tree, we move down the tree going right, left, left. We then want to add e as the new left child of f. But in a persistent data structure, we never edit any node so we need to create a copy f' of f and set e as the new left child of f'. We then want to set the left child of g to be this new node f', but of course cannot edit g so instead we create a new node g'. The left child of g' is set to f'. But next, notice how the right child of g' is set to equal the right child of g. That is, the node h is shared between both the old tree and the new tree! Similarly, we then create a clone d' of the node d and reuse the left child of d. We return a new value ys of type shared_ptr<Node> pointing to this new node d'.

Notice how we now have two data structures, one pointed to by xs and one pointed to by ys. The result of adding e did not change xs at all. We could still call find on xs and it would find all the same elements as before. Sharing the nodes between xs and ys is safe, because we know that we never edit any node. For example, no operation performed on ys will ever modify any node shared with xs so we could conceptually consider xs and ys two completely separate data structures.

Finally, notice how using shared pointers correctly deallocates the nodes. Consider that xs is deallocated. This decrements the reference count for the d to zero so it will be deallocated. When d is deallocated, the reference count of b is decremented, but it is decremented from two down to one, since d' still has a shared_ptr to b. In general, each node's reference count will be how many trees it is still in, and will only be deallocated when no tree is using that node anymore (consider the general picture when there are many trees all sharing different nodes).

The above discussion is for an unbalanced binary search tree where things are simple; it gets more complicated for balanced structures. It is only in the past 10 years or so that fast efficient balanced persistent data structures have appeared (well, the data structures were first defined in the 90s but wasn't until the 2000s that practical, fast implementations were written). For that reason languages designed before this do not have persistent data structures, for example the map in the C++ standard is not persistent. Also, inertia plays a large role as well because existing programming languages and existing programs largely use non-persistent data structures.

Recently, there has been a push to switch to using persistent data structures. For example, Facebook is pushing heavily immutable-js for javascript which is a library with persistent implementations of data structures. In fact, that page has a good "Case for Immutability" section.

Split

So I am now going to work towards implementing union, difference, and intersection on treaps. But before implementing these, we need a couple utility methods. The split method takes as input a value \(x\) and returns two new treaps \(t_1\) and \(t_2\) and a node \(a\). \(t_1\) will contain all the elements of the original treap that are strictly less than \(x\), \(t_2\) will contain all the elements of the original treap that are strictly larger than \(x\), and \(a\) will be the node which equals \(x\) (if such a node exists).

Split works recursively:

split(v, x):
    # Split returns a triple (t1, t2, a) where t1 is the root node of a treap with all elements
    # smaller than x, t2 is the root of a treap with all elements larger than x, and a is a node
    # equal to x.
    if v == nil:
        return (nil, nil, nil)

    elif v.x == x:
        # Found node equal to v, return its left and right subtrees as the two sides of the split
        return (v.left, v.right, v)

    elif v.x > x:
        # x is in the left subtree of v, so recursively split
        (r1, r2, a) = split(v.left, x)

        # Create a new clone of v (we never edit any node)
        # vclone will be the root of the tree consisting of all elements larger than x.
        vclone = new node
        vclone.x = v.x
        vclone.p = v.p

        # The new left child of vclone is r2, all the elements from the left subtree of v
        # which were larger than x.
        vclone.left = r2

        # The right subtree of vclone is the same as the right subtree of v
        vclone.right = v.right

        # Return r1 as the tree smaller than x, vclone as the tree larger than x, and a as the node
        # equal to x
        return (r1, vclone, a)


    else:
        # x is in the right subtree of v
        (r1, r2, a) = split(v.right, x)

        # vclone will be the root of a new treap consisting of elements smaller than x
        vclone = new node
        vclone.x = v.x
        vclone.p = v.p
        vclone.left = v.left
        vclone.right = r1 # r1 is all the elements from the right smaller than x
        return (vclone, r2, a)

Correctness

Lemma: Let \(v\) be the root of a treap and let \(x\) be any element. Also, let \((t_1, t_2, a) = split(v, x)\). Then \(t_1\) and \(t_2\) are roots of treaps, where \(t_1\) contains all elements of \(v\) which are smaller than \(x\), \(t_2\) is the root of a treap which contains all elements of \(v\) which are larger than \(x\), and \(a\) is a node with \(a.x = x\) or \(a\) is nil if no such node exists.

Proof The proof is by induction on the size of the treap rooted at \(v\).

Base case: If \(v\) is nil, returning \(t_1\), \(t_2\), and \(a\) all nil is correct.
Base case: \(v.x = x\). If \(v.x\) is equal to \(x\), then by the properties of treaps we know that \(v.left\) is the root of a treap consisting of all elements smaller than \(v.x\) and \(v.right\) is the root of a treap consisting of all elements larger than \(x\). Therefore, returning \((v.left, v.right, v)\) is correct.
Inductive case when \(v.x > x\). First, everything to the right of \(v\) is larger than \(v.x\) so is larger than \(x\), so we know that everything to the right is larger than \(x\). From the elements to the left of \(v\), some will be smaller than \(x\) and some will be larger. We therefore first split the left subtree around \(x\), getting \((r_1, r_2, a) = split(v.left, x)\). By induction, \(r_1\) is the root of a treap consisting of all elements smaller than \(x\) from the left subtree of \(v\). Since no other elements are smaller than \(x\), returning \(t_1 = r_1\) is correct. Also by induction, \(r_2\) is the root of a treap consisting of all elements from the left subtree of \(v\) which are larger than \(x\). We need to combine \(r_2\) with \(v\) itself as well as the subtree \(v.right\) to form a treap of all elements larger than \(x\). Since \(v\) was originally a treap, everything has larger priority than \(v\). Therefore, \(v\) will be the new root of this treap of things larger than \(x\). But since we want a persistent data structure, we do not want to edit \(v\). Instead, we create a clone \(vclone\) of \(v\). First, \(vclone\) will have the same \(x\) and \(p\) value as \(v\). Then, we set \(vclone.left = r_2\). Since \(r_2\) is a subset of elements from the left subtree of \(v\), all elements in \(r_2\) have smaller \(x\) value than \(v\) so they go as the left subtree of \(vclone\). Finally, \(v.right\) is the subtree of all elements larger than \(v.x\) so is the right subtree of \(vclone\). Adding elements in this way makes \(vclone\) the root of a treap.
Inductive case when \(v.x < x\). This is very similar to the previous case except we recurse to the right. Now \(r_2\) is the root of a treap of all elements larger than \(x\) so it is correct to return \(t_2 = r_2\). We create a new clone \(vclone\) to be the root of a treap of all elements smaller than \(x\), consisting of the original left child of \(v\) plus whatever the recursive split returned as the elements from the right smaller than \(x\).

Running Time

Split runs in time \(O(height)\) since it recurses to the left or right child and each recursive call is \(O(1)\). Therefore, split runs in expected time \(O(\log n)\). In addition, the space usage of split is expected \(O(\log n)\) since each call stack frame uses \(O(1)\) space. It is possible to use the reference trick similar to remove to make split tail-recursive: before recursing, we create the new node. Then, we pass references to \(t_1\) and \(t_2\). This allows the callee to set \(t_1\) and \(t_2\) directly instead of returning them. If you are interested, page 3 of this paper has the C++ code for such a tail-recursive split.

Union

union(v1, v2):
    # v1 and v2 are roots of treaps
    # The return value is the root of a treap which is the left-biased union of v1 and v2

    # If either v1 or v2 is nil, the other is the result of the union
    if v1 == nil:
        return v2
    if v2 == nil:
        return v1

    if v1.p < v2.p:
        # We know v1.p < v2.p, so v1 will become the new root of the union.

        # First, split v2 around v1.x since the elements from v2 must be put on
        # the left and right of v1 since v1 is becoming the new root.
        (t1, t2, a) = split(v2, v1.x)

        # Create a new root to hold the result of the union, which is a clone of v1
        w = new node
        w.x = v1.x
        w.p = v1.p

        # t1 is all elements from v2 which are smaller than v1.x, so these must be
        # unioned into v1.left.
        w.left = union(v1.left, t1)

        # t2 is all elements from v2 which are larger than v1.x, so these must be
        # unioned into v1.right
        w.right = union(v1.right, t2)

        return w

    else:
        # We know v1.p > v2.p, so v2 will become the new root of the union.

        # Split v1 around v2.x so the elements from v1 can be merged into v2.
        (t1, t2, a) = split(v1, v2.x)

        # A new clone of v2
        w = new node
        w.p = v2.p

        # If a is not nil, we want to use the x value from v1 because our union is left-biased
        if a != nil:
            w.x = a.x
        else:
            # Otherwise, v2.x does not exist in v1 so use v2.x
            w.x = v2.x

        w.left = union(t1, v2.left)
        w.right = union(t2, v2.right)
        return w

Running Time

Let \(u_1\) have size \(n\) and \(u_2\) have size \(m\), with \(m \leq n\). The expected running time of \(union(u_1, u_2)\) is \(O(m \log(\frac{n}{m}))\). The analysis is complicated (the full details are in this paper), but just heuristically we can argue as follows. Each instance of union runs in expected time \(O(\log n)\) since the call to split will process one of \(u_1\) or \(u_2\) and split runs in expected time log of the size of the tree. We therefore want to estimate the number of times union will execute. Note that each recursive instance of union processes a single node w from either \(u_1\) or \(u_2\) and then recurses on all other nodes below w. In other words, we can think of each instance of union as taking a single unprocessed node w, adding it into the result tree, and recursing on the remaining unprocessed nodes. Due to the randomness in the choice of \(p\), we would expect that it is equally likely for union to select w from \(u_1\) or from \(u_2\). Since \(u_2\) is smaller and we expect to select something from \(u_2\) roughly half the time, we will likely have taken everything from \(u_2\) after \(O(|u_2|) = O(m)\) instances of union. Once everything from \(u_2\) has been taken, we are now at a base case where either v1 or v2 are nil and the recursion stops. Therefore, we expect \(O(m)\) total calls to union and each call to union takes expected \(O(\log n)\) time for a total of \(O(m \log n)\). The place where we were a little bit too loose in the analysis is that as we proceed through the union calls, the trees passed to split are getting smaller and smaller so not every call to split will take \(O(\log n)\) time.

Insert

We can insert a new element into a treap by just creating a singleton treap consisting just of the new element and then unioning it with the existing treap.

insert(v, x):
    # v is the root of a treap and x is a new element.
    # returns a new root of a treap with x inserted.

    w = new node
    w.x = x
    w.p = random number
    return union(w, v)

The running time is expected \(O(\log n)\), since the size of the smaller treap passed to union is \(m = 1\).

Exercise

Write a proof that union works correctly. In particular, prove that the return value from union a treap, i.e. the return value satisfies the BST and heap property. Also, argue why the operation is persistent and does not edit the existing treaps.