Nov 2

Sum type

A sum type (also called a tagged union, discriminated union, or variant) is a data structure which can hold a value that can take on several different, but fixed, types. Only one of the types of data can be stored at a time, and the data structure contains a tag property which specifies which type of data is currently stored. For more background, some examples of why you would use sum types, and a discussion of various possible implementations, see like this article.

Unions

struct Foo {
    int a;
    float b;
};

union Bar {
    int a;
    float b;
}

Say we are on a 32-bit CPU where int and float are each 32-bits. Then the Foo struct tells C++ that we want to group memory into a 64-bit chunk. The first 32 bits will store an int and the second 32-bits will store a float. When we access the properties a and b, C++ accesses the respective memory locations which are distinct.

In contrast, Bar is a union which causes C++ to use the same memory for all properties. What this means is that Bar will only take 32 bits in memory. If we access the a property, C++ will interpret those 32 bits as an int. If we access the b property, C++ will interpret those 32 bits as a float. Obviously this is very dangerous! If you write to one property and read from the other, your program will have undefined behavior. At any point in time in your program, you should only be reading and writing to one of the properties a and b, not both!

The reference states "The union is only as big as necessary to hold its largest data member. The other data members are allocated in the same bytes as part of that largest member. The details of that allocation are implementation-defined, and it's undefined behavior to read from the member of the union that wasn't most recently written".

Tagged Unions

Because only one property of a union should be accessed at a time, the C++ guidelines on unions says to never use a naked union (a union such as Bar). Instead, the union should be embedded into a struct which stores a tag which specifies which property of the union is active. For example,

struct TaggedUnion {
    enum {INT_TAG, FLOAT_TAG} tag;
    union {
        int a;
        float b;
    };

    TaggedUnion(int intVal) : tag(INT_TAG), a(intVal) {};
    TaggedUnion(float floatVal) : tag(FLOAT_TAG), b(floatVal) {};
}

(Side note, the union above is called an anonymous union since it is not given a name. The C++ guidelines recommend anonymous unions for tagged unions, although the details in the guideline aren't written yet.)

The struct TaggedUnion will consist of enough memory to hold the tag (amount depends on the compiler) and then 32 bits which can be interpreted either as an int or a float. We then have two constructors, one constructor that initializes the TaggedUnion to an integer and one which initializes the TaggedUnion to a float. Each constructor initializes the tag and then the appropriate property of the union.

Note that we use a special syntax to initialize the properties. After the parameter list to the constructor, we have a colon and then a list of initializers. Each initializer is a property name, then an open paren, then the value that property should be initialized to, and then a close paren. For example, tag(INT_TAG). It sort of looks like function calls but it is not, it is how we can tell C++ the initial values of properties. (This syntax works for any struct and any properties, not just unions.)

Now that we have a tagged union, we can write code that first checks the tag, and depending on the tag accesses either the a or b property.

void print_tagged(TaggedUnion &u) {
    switch (u.tag) {
        case TaggedUnion::INT_TAG:
            cout << "Has an int: " << u.a << endl;
            break;
        case TaggedUnion::FLOAT_TAG:
            cout << "Has a float: " << u.b << endl;
            break;
    };
}

int main() {
    TaggedUnion u1(53); // call the int constructor which sets tag to INT_TAG
    TaggedUnion u2((float)3.7); //call the float constructor which sets tag to FLOAT_TAG
    print_tagged(u1);
    print_tagged(u2);
}

A function such as print_tagged is the main benefit of a tagged union: we can write functions which do different things depending on which value is stored inside the union.

Destructors

Before C++ 11, the properties of a union could not have destructors because C++ would not know which destructor should be called. For example, the above code is OK in older versions of C++ since the members of the union are int and float. But if you tried to change float to string, C++ will give you an error if you compile with -std=c++98. This is because the memory inside the union can either be semantically a float or a string, and C++ does not know which destructor to call.

As part of C++11, the C++ committee updated the standard to allow members of a union to have destructors. There are two complexities: you must manually call the proper destructor and you cannot change which member of a union is active without manually calling the destructor and constructor. It is straightforward to call the correct destructor just using a switch on the tag, similar to print_tagged above. Changing which member of a union is active requires a feature called "placement new". Since in this course we will never be changing the active member inside a tagged union, I won't explain it. If you are interested, Google is your friend.

#include <iostream>
using namespace std;

struct Foo {
    int fooIntVal;
    Foo(int aVal) : fooIntVal(aVal) {
        cout << "Foo cstr " << fooIntVal << endl;
    };
    ~Foo() {
        cout << "Foo destructor " << fooIntVal << endl;
    }
};

struct Bar {
    float barFloatVal;
    Bar(float bVal) : barFloatVal(bVal) {
        cout << "Bar cstr " << barFloatVal << endl;
    };
    ~Bar() {
        cout << "Bar destructor " << barFloatVal << endl;
    };
};

struct FooOrBar {
    enum {FOO_TAG, BAR_TAG} tag;
    union {
        Foo f;
        Bar b;
    };

    FooOrBar(int iVal) : tag(FOO_TAG), f(Foo(iVal)) {};
    FooOrBar(float fVal) : tag(BAR_TAG), b(Bar(fVal)) {};
    ~FooOrBar() {
        //call the proper destructor based on the tag
        switch (tag) {
            case FOO_TAG:
                f.~Foo();
                break;
            case BAR_TAG:
                b.~Bar();
                break;
        };
    };
};

void print_fooOrBar(FooOrBar &x) {
    switch (x.tag) {
        case FooOrBar::FOO_TAG:
            cout << "Foo with int " << x.f.fooIntVal << endl;
            break;
        case FooOrBar::BAR_TAG:
            cout << "Bar with float " << x.b.barFloatVal << endl;
            break;
    };
};

int main() {
    FooOrBar f1(12);
    FooOrBar f2((float)5.7);

    print_fooOrBar(f1);
    print_fooOrBar(f2);
}

Try running the above code and note how the destructor for Foo or Bar is called at the time that main returns.

Sum Types

While the above code works, there is a lot of chances for bugs to creep in. If the code forgets to access the tag and accesses the wrong property of the union, we have a bug. If the code switches the active property of the union and forgets to update the tag or doesn't call the destructor, there is a bug. A sum type is a safe tagged union: it handles internally the tag and the union and makes it so users of the tagged union can't make these mistakes. In C++, the main implementation of a sum type is boost's variant. This was designed before C++11 existed, and in C++11 some new features were added to the language which make using sum types easier so there exist alternative variants, for example egg-cpp/variant.

Exercise

json.cpp contains a tagged union for a JSON value. First, read the code and make sure you understand what it is doing and how the tagged union works. The reference for vector and map are helpful.

The exercise is for you to fill in the contents of the print_json function. It should use a switch on the tag and print to cout the appropriate contents of the json value. I have given you some helper functions to print strings, arrays, and maps, since I want you to concentrate on the tagged union part.