Tech Musings and Assorted Thoughts: Getting Started with Boost

Ten nifty tricks to incorporate into your armoury in your first week of using Boost

Boost has a wide array of libraries, and they often differ in complexity, expanse and sometimes power, but not in usefulness. Obviously a utility to convert strings to integers will not have the same complexity or power as a library for functional compositions. But both are useful in their own right. Today, Boost has close to 40 libraries (may be more) and this large number, along with the wide variety therein, sometimes serves to overawe the new learner. This article has a specific purpose - to get you started with a few, relatively easy, nifty tricks that you can employ in many general cases to add value and usability to your code, without having to be a Boost expert. Such head-starts serve to increase our motivation to use a tool, exposing us to the first hand benefits of a tool at the expense of a little effort. So without further ado, let me cut to the chase.

1. Time measurements and thermometer bars

The following piece of code illustrates how to use the Boost.Timer library to measure elapsed time between two points in code. Using this facility, you can easily measure the time taken to execute a certain section of code. To view the sorce code with numbered lines, click "Show line numbers".

Show line numbers

    1 #include <iostream>
    2 #include <sstream>
    3 #include <boost/progress.hpp>
    4 #include <windows.h>
    5 using boost::progress_timer;
    6 using std::cout;
    7 using std::stringstream;
    8 using std::endl;
    9
   10 int main()
   11 {
   12     stringstream os;
   13     double elapsed_time = 0.0;
   14     {
   15         progress_timer t(os);
   16         int arr[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15};
   17         const int arr_elems = sizeof(arr)/sizeof(int);
   18
   19         for ( int i=0; i<arr_elems; i++ ) {
   20             Sleep(3000);
   21         }
   22     }
   23
   24     if ( os >> elapsed_time ) {
   25         cout << "Net elapsed time: " << elapsed_time << endl;
   26     } else {
   27         cout << "Net elapsed time: " << os.str() << endl;
   28     }
   29
   30     return 0;
   31 }

The Boost header we used for our purpose is progress.hpp. We create an array of 16 integers and simulate some processing on this array by iterating through the elements of the array and pausing for 3 seconds for each element to simulate a period of processing (line 20). We intend to find the time taken to set up the array and then perform this dummy operation. To do this, we enclose the relevant part of the code (lines 14 through 22) in braces to define a scope. At the start of the scope, we declare an object of boost::progress_timer and initialize it with a stream object which can be written to. At the end of the scope, the progress timer object will write the total time elapsed to the stream object. Thereafter we either extract the double precision floating point value of the time elapsed in seconds from the stream (line 24) or directly write the stream's contents to the standard output (line 27).

2. Tokenizing strings and seamless value conversions

Suppose you are working on a banking back-office system and are scrubbing the summary data from your bank's daily closing report. Now this report is in some delimited format which your report rendering software understands, but which is not XML or something. Your requirement is simple - you want to extract all the data from the report and do some processing on it. In other words, you need to parse the report data. Second, there are numeric items in the list as well and you want to do quick comparisons of some of those - like the age of the account holder and the net balance in her account. All these quantities will be read as strings from the input so we will also need an effective and consistent way to convert these to numeric values. Parsing data of even moderate complexity can be a tough job and while the standard library containers and string classes provide adequate tools for doing this job, the code you'd end up with often tends to be long and not very readable. And converting string to numeric types isn't always straightforward either. On the other hand you don't want to spend more than half-hour in writing the code. What do you do. Well, you use the Boost.Tokenizer and Boost.Conversion libraries and you are through.

Suppose the data comes to you in the form of several records delimited inside braces ( { and } ). Within each record, several fields are delimited by semi-colons (;) and each field is a name-value pair of the form name=value or name:value. You don't want to be bothered about verifying the correctness of the grammar - you could assume that the document is well-formed and you then want to extract the data. Here is some sample data.

{name:Chaim Rabinowitz;acc:5629311372;bank:Citibank N.A.;branch:Tel-Aviv;age:69;balance:100000.00}{name:Mordechai Silberman;acc:46268675539;bank:Carmel Bank;branch:Haifa;age:43;balance:20000.00}{name:Amir Belleli;acc:5538639473;bank:Carmel Bank;branch:Nahariyya;age:54;balance:64000.00}{name:Liat Gershgoren;acc:5538639473;bank:The Hebrew Bank;branch:K'far Etzion;age:27;balance:28000.00}

And here is the code.

Show line numbers

    1 #include <iostream>
    2 #include <boost/tokenizer.hpp>
    3 #include <boost/algorithm/string.hpp>
    4 #include <boost/lexical_cast.hpp>
    5 #include <string>
    6 #include <vector>
    7 using std::cout;
    8 using std::endl;
    9 using std::string;
   10 using std::vector;
   11 using boost::char_separator;
   12 using boost::split;
   13 using boost::lexical_cast;
   14
   15 struct AccountSummary {
   16     string name;
   17     int age;
   18     string bank;
   19     string branch;
   20     string account_number;
   21     double balance;
   22
   23     AccountSummary() : age(0), balance(0.0) {
   24     }
   25 };
   26
   27 int main()
   28 {
   29     string str = "{name:Chaim Rabinowitz;acc:5629311372;bank:Citibank N.A.;branch:Tel-Aviv;age:69;balance:100000.00}"
   30                  "{name:Mordechai Silberman;acc:46268675539;bank:Carmel Bank;branch:Haifa;age:43;balance:20000.00}";
   31
   32     typedef boost::tokenizer<char_separator<char> > tokenizer;
   33     char_separator<char> sep("{}");
   34     tokenizer records(str, sep);
   35     vector<AccountSummary> account_data;
   36     bool bad_data = false;
   37
   38     for (tokenizer::iterator tok_iter = records.begin(); tok_iter != records.end(); ++tok_iter) {
   39         string next_record = *tok_iter;
   40         AccountSummary next_acc_data;
   41         char_separator<char> sep2(";");
   42         tokenizer fields(next_record, sep2);
   43
   44         try {
   45             for (tokenizer::iterator fld_iter = fields.begin(); fld_iter != fields.end(); ++fld_iter) {
   46                 string next_field = *fld_iter;
   47                 vector<string> key_val;
   48                 split(key_val, next_field, boost::is_any_of(":="));
   49
   50                 if ( key_val.size() >= 2 ) {
   51                     const string& key = key_val[0];
   52                     const string& value = key_val[1];
   53
   54                     if (key == "name") {
   55                         next_acc_data.name = value;
   56                     } else if (key == "acc") {
   57                         next_acc_data.account_number = value;
   58                     } else if (key == "bank") {
   59                         next_acc_data.bank = value;
   60                     } else if (key == "branch") {
   61                         next_acc_data.branch = value;
   62                     } else if (key == "age") {
   63                         next_acc_data.age = lexical_cast<int>(value);
   64                     } else if (key == "balance") {
   65                         next_acc_data.balance = lexical_cast<double>(value);
   66                     } else { // Do nothing
   67                     }
   68                 }
   69             }
   70         }  catch(boost::bad_lexical_cast& be) {
   71             bad_data = true;
   72         }
   73
   74         if ( bad_data ) {
   75             bad_data = false;
   76             continue;
   77         }
   78         account_data.push_back(next_acc_data);
   79     }
   80
   81     return EXIT_SUCCESS;
   82 }

Obviously, this code does a few things with the data, but certainly does not print it. For a nice way to print this data in a suitable format, look at section 4 below.

3. Expressive code with zero resource-leaks

In an ideal world, all libraries and APIs you work with will have perfect semantics and you will never need to make compromises of any kind in your code. In such a world, you would only need to work with first class C++ objects which take care of their own initialization and clean-up in the C++ tradition of construction and destruction. In the real world however, you would be writing C++ applications to talk to legacy systems using a C API that is less than ideal. In that real world, you will have strdup-style C APIs that return dynamically allocated storage to you and expect you to take care of the life cycle of the pointer, remembering to call free on the pointer once you are done. As a real life programmer you would surely face such dilemmas several times, in several forms - and that's where the family of Smart pointers comes into the picture.

Smart pointers are simple classes which can "wrap" pointers to objects of different kinds, and in some cases arrays of objects. The primary advantage in the use of such wrappers is the seamless life-cycle management of these pointers (through the RAII idiom) and exception-safety. In fact the Smart Pointer collection from Boost is so good that you no more have an excuse to write C++ code that has memory leaks. I am not about to start a discussion on either the RAII-idiom or exception-safety - primarily because of lack of space in this column, and sites that handle the topics with more focus. Here we will see how the use of Boost's smart pointers provide power, simplicity and correct, reliable behaviour to your code.

Consider the following piece of code - let me tell you, it does not compile.

Show line numbers

    1 #include <iostream>
    2 #include <string>
    3 #include <cctype>
    4
    5 using std::cout;
    6 using std::endl;
    7 using std::string;
    8
    9 extern "C" int strucase(char *str);
   10
   11 int strucase(char *str) {
   12     int count = 0;
   13     if ( str ) {
   14         do {
   15             (islower(*str)) ? (*str=toupper(*str), count++) : 0;
   16         } while(*str++);
   17     }
   18
   19     return count;
   20 }
   21
   22 string new_api() {
   23     return "Eddie Cochran";
   24 }
   25
   26 int main()
   28 {
   29     string str = new_api();
   30     strucase(str.c_str());
   31
   32     return 0;
   33 }

Given a C++ string, if you want to pass it's contents as a char* to a function that takes a (non-const) char* rather than a const char* (like strucase above), you have to make a copy of the string's underlying character array into a new array, and pass the base address of the array instead. It's a bit of a fib when I say you have to - what good is const_cast for after all? But then, what guarantee do you have that this strucase function that takes a non-const char* does not write to the string you passed. In fact, in this case strucase does write to the string, converting it in place to an all upper-case ASCII string. So the call at line 30 above fails (click Show line numbers to see line numbers). The string returned by new_api can be of arbitrary length (not exceeding platform specific limits, still) so you cannot create an array, with a size known at compile time, to hold the copy of the string. In other words, you have to allocate memory dynamically, and whenever you do that there is the hassle of correctly deallocating the allocated memory once every part of your code is done with using it. The following code addresses the issue with a smart array from Boost.

Show line numbers

    1 #include <iostream>
    2 #include <string>
    3 #include <cstring>
    4 #include <boost/scoped_array.hpp>
    5 #include <cctype>
    6
    7 using std::cout;
    8 using std::endl;
    9 using std::string;
   10 using boost::scoped_array;
   11
   12 extern "C" int strucase(char *str);
   13
   14 int strucase(char *str) {
   15     int count = 0;
   16     if ( str ) {
   17         do {
   18             (islower(*str)) ? (*str=toupper(*str), count++) : 0;
   19         } while(*str++);
   20     }
   21
   22     return count;
   23 }
   24
   25 string new_api() {
   26     return "Eddie Cochran";
   27 }
   28
   29 int main() {
   30     string str = new_api();
   31     scoped_array sa( new char[str.length()+1] );
   32     char *copied_str = sa.get();
   33     strncpy(copied_str, str.c_str(), str.length());
   34     copied_str[str.length()] = 0;
   35     strucase(copied_str);
   36
   37     return 0;
   38 }

The first good thing about the above code is that it ensures there is no memory leak - the array allocated on line 31 is wrapped inside the scoped array object sa and at the end of the enclosing scope (in this case the main function), the destructor of scoped array kicks in and deallocates the entire array by calling delete[] on it.

Now imagine that instead of all this code being written inside the main() function, it was a different function which carried out these operations returned the allocated character array. Something like the following will obviously crash your code:

Show line numbers

    1 char* my_new_string() {
    2     string str = new_api();
    3     scoped_array sa( new char[str.length()+1] );
    4     char *copied_str = sa.get();
    5     strncpy(copied_str, str.c_str(), str.length());
    6     copied_str[str.length()] = 0;
    7     strucase(copied_str);
    8
    9     return copied_str;
   10 }
   11
   12 int main() {
   13     char * new_str = my_new_string();
   14     cout << new_str << endl;
   15     char c='a';
   16     new_str[0] = c;
   17 }

Lines 14 and 16 are recipes for disaster because new_str is a copy of the pointer copied_str that the my_new_string() function passes back to you. And both these are aliases or copies of the underlying pointer in the scoped_array in my_new_string(). As soon as my_new_string() returns, scoped_array makes sure that the memory pointed to by its underlying pointer is deallocated. So in the main() function, you are dealing with an invalid memory reference - new_str.

What do we do? Pass back a copy of the scoped_array object itself? For one, scoped_array cannot be copied. Even if it could be, whether that would work or not would depend on whether or not scoped_array detects copying and such replication operations and keeps a count of the places in the code that have valid references to the object. This is called reference counting and scoped_array does not do it. Boost shared_array does. All you need to change in the above code is this:

Show line numbers

    1 shared_array<char> my_new_string() {
    2     string str = new_api();
    3     shared_array sa( new char[str.length()+1] );
    4     char *copied_str = sa.get();
    5     strncpy(copied_str, str.c_str(), str.length());
    6     copied_str[str.length()] = 0;
    7     strucase(copied_str);
    8
    9     return sa;
   10 }
   11
   12 int main() {
   13     shared_array<char> sa = my_new_string();
   14     char *new_str = sa.get();
   15     cout << new_str << endl;
   16     char c='a';
   17     new_str[0] = c;
   18 }

Now, whenever you need to pass around a dynamically allocated array, just pass around the shared_array wrapper for it. It takes care of everything else.

There are more goodies in the Boost Smart Pointer library, including its work-horse - shared_ptr. In fact having already shown scoped_array and shared_array, it is not much work now to get up and running with scoped_ptr and shared_ptr. Both scoped_ptr and shared_ptr are capable of encapsulating a pointer to a single object on dynamically allocated storage. scoped_ptr is, of course, not copiable and not reference counted - suitable for operations limited in a single scope. shared_ptr on the other hand is a remarkable mobile smart pointer - you can freely copy and pass it around, it is fully reference counted. These classes overload the -> and * operators and therefore working with them is a breeze. Here is a short example to wrap up this section.

Show line numbers

    1 #include <iostream>
    2 #include <string>
    3 #include <boost/shared_ptr.hpp>
    4
    5 using std::cout;
    6 using std::endl;
    7 using std::string;
    8 using boost::shared_ptr;
    9
   10 struct Foo {
   11     void bar() {}  
   12 };
   13
   14 typedef shared_ptr<Foo> FooPtr;
   15
   16 FooPtr createFoo() {
   17     FooPtr sp(new Foo);
   18     return sp;
   19 }
   20
   21 int main() {
   22     FooPtr sp = createFoo();
   23     sp->bar();
   24 }

Imagine how you would have handled the case where you could need to create arbitrary objects on the free-store with new and pass them around in the program, across scopes, sharing between multiple owning objects and scopes. Without the shared_ptr, such brevity was impossible.

In reality, shared_ptr models a "shared ownership" idiom that concerns managing the life cycle of the underlying heap-object. Sometimes, this is precisely what we want to avoid - ownership. In such cases, shared_ptr fails to prevent memory leaks and we need other alternatives. For more details, see weak_ptr's.

4. Using reference wrappers to write efficient code

Suppose you are dealing with objects of a class which cannot be copied, and perhaps not be assigned either. May be you are modelling humans in a social group and one of the assumptions / constraints is that humans cannot be replicated or cloned. So a Human object would essentially be a non-assignable, non-copiable object. And yet, for the purposes of effectively managing large groups of humans, you have to use different kinds of containers. May be you are modelling the behaviour of humans in a queue and you choose an std::list here - because the Humans have the liberty to leave the queue at any point. A list perhaps also makes sense, because a Human might opportunistically join a queue at the middle rather than at the end where he is actually supposed to join - to reduce his personal wait time. Whatever the social engineering issue here, we have a bigger computer programming issue to beat. Humans are not copiable or assignable - but an std::list container requires its elements to be both. You have no choice - either make the Humans copiable (an untenable model) or roll out your own containers which can store object references - or so you sulk. Enter Boost reference wrappers - a feature that should actually make it to the next release of the C++ Language and Standard Library (C++0x) - and you have reason to cheer.

The idea behind reference wrappers is simple. Create a trivial wrapper around a reference to an object and make this wrapper copiable and assignable. This wrapper object stores a reference or pointer to the real object. The wrapper object can be easily copied and assigned - without the underlying object getting replicated. Finally, the wrapper object should transparently degrade to the underlying type in contexts where an object of the underlying type is needed. This makes a reference wrapped object very easy to store in STL containers and easy to use in STL algorithms. The boost::reference_wrapper<T> template conforms to all these requirements and does a great job of it. Consider the following code.

Show line numbers

    1 #include <iostream>
    2 #include <vector>
    3 #include <algorithm>
    4 #include <boost/ref.hpp>
    5 using std::cout;
    6 using std::endl;
    7
    8 struct Human {
    9     Human() : id_(++count) {
   10     }
   11
   12     const int id() const {
   13         return id_;
   14     }
   15
   16 private:
   17     const int id_;
   18     Human(const Human& srt);
   19     Human& operator = (const Human& srt);
   20     static int count;
   21 };
   22
   23 int Human::count = 0;
   24
   25 template<class T>
   26 struct PrintId {
   27     int operator()(T t) {
   28         cout << t.id() << endl;
   29         return 0;
   30     }
   31 };
   32
   33
   34 int main() {
   35     using std::vector;
   36     Human h1, h2, h3, h4;
   37
   38     vector< boost::reference_wrapper<Human const> > people;
   39     people.reserve(4);
   40     people.push_back(boost::cref(h1));
   41     people.push_back(boost::cref(h2));
   42     people.push_back(boost::cref(h3));
   43     people.push_back(boost::cref(h4));
   44
   45     std::for_each(people.begin(), people.end(), PrintId<Human const&>());
   46 }

The above code defines the Human class, and each object of the class gets a unique id at the time it is created - which distinguishes it as a unique individual, distinct from others of the same kind (Humans).

Four humans are created and they are stored in a fixed order in a vector of reference wrapped Humans. This is necessary because we could not have had a vector<Human> since Human is not a copiable class. Since boost::reference_wrapper<T> is copiable, we can create a vector<boost::reference_wrapper<T> >.

Finally, we print the unique id of each Human in the people vector. If you notice, we actually reference wrapped and stored const Human, and we created such reference wrapped objects from regular objects using the call boost::cref(...). You could have used boost::ref(...) instead to create non-const references. Most importantly, the functor PrintId does not need to know anything about reference wrappers. It's overloaded operator() takes a normal reference to a const Human, and when the for_each algorithm is used, the reference wrapper elements from the vector gracefully degrade to const Human&. This is huge convenience if you consider how brief the code is.

But if you thought this is the limit of reference_wrapper's applicability, you have been deceived. The reference_wrapper class is not only an idiom-enabling construct (allowing you to store references in STL containers), it is also an optimization-enabling construct and in many cases, changing existing code to use reference_wrappers can actually reap tremendous performance benefits. Consider the following piece of STL code:

Show line numbers

    1 #include <iostream>
    2 #include <vector>
    3 #include <algorithm>
    4 using std::cout;
    5 using std::endl;
    6
    7 struct Foo {
    8     Foo() : id_(++count) {
    9         cout << "Object constructed: id = " << id_ << endl;
   10     }
   11
   12     const int id() const {
   13         return id_;
   14     }
   15
   16     Foo(const Foo& srt) : id_(++count) {
   17         cout << "Object constructed: id = " << id_
   18              << ", copied from Object[ id = " << srt.id() << " ]" << endl;
   19     }
   20
   21     Foo& operator = (const Foo& srt) {
   22         return *this;
   23     }
   24
   25 private:
   26     const int id_;
   27     static int count;
   28 };
   29
   30 int Foo::count = 0;
   31
   32 template<class T>
   33 struct Bar {
   34     int operator()(T t) {
   35         cout << "Object[ id = " << t.id() << " ]" << endl;
   36         return 0;
   37     }
   38 };
   39
   40
   41 int main() {
   42     using std::vector;
   43     Foo f1, f2, f3, f4;
   44
   45     vector< Foo > foos;
   46     foos.reserve(4);
   47     foos.push_back(f1);
   48     foos.push_back(f2);
   49     foos.push_back(f3);
   50     foos.push_back(f4);
   51
   52     std::for_each(foos.begin(), foos.end(), Bar<Foo const&>());
   53 }

This shows a regular class Foo, which retains a unique id (from the Human example) being stored in a vector< Foo > and then the functor Bar<Foo const&> being invoked on each element of the vector. The Foo object is copiable and assignable, although on copy-construction the new object gets its own unique id, and even on assignment, this unique id is retained and not overwritten. Whenever a new object is created through the default constructor, a message of the form:

Object constructed: id = <object_id>

is printed. Whenever a new object is created through the copy constructor, a similar message of the form:

Object constructed: id = <object_id>, copied from Object[ id = <source_object_id> ]

is printed. On my system, this prints the following messages.

Object constructed: id = 1
Object constructed: id = 2
Object constructed: id = 3
Object constructed: id = 4
Object constructed: id = 5, copied from Object[ id = 1 ]
Object constructed: id = 6, copied from Object[ id = 2 ]
Object constructed: id = 7, copied from Object[ id = 3 ]
Object constructed: id = 8, copied from Object[ id = 4 ]
Object[ id = 5 ]
Object[ id = 6 ]
Object[ id = 7 ]
Object[ id = 8 ]

If I remove the call to the reserve method of the vector at line 46, you might see a few more copies of the Foo objects being done as the vector initially expands upon each push_back. On my system (MSVC 9), with line 46 commented, the output looks like:

Object constructed: id = 1
Object constructed: id = 2
Object constructed: id = 3
Object constructed: id = 4
Object constructed: id = 5, copied from Object[ id = 1 ]
Object constructed: id = 6, copied from Object[ id = 5 ]
Object constructed: id = 7, copied from Object[ id = 2 ]
Object constructed: id = 8, copied from Object[ id = 6 ]
Object constructed: id = 9, copied from Object[ id = 7 ]
Object constructed: id = 10, copied from Object[ id = 3 ]
Object constructed: id = 11, copied from Object[ id = 8 ]
Object constructed: id = 12, copied from Object[ id = 9 ]
Object constructed: id = 13, copied from Object[ id = 10 ]
Object constructed: id = 14, copied from Object[ id = 4 ]
Object[ id = 11 ]
Object[ id = 12 ]
Object[ id = 13 ]
Object[ id = 14 ]

Things look only marginally better with gcc 4.1. This copying is at its peak initially when the vector starts to grow because the vector's storage is repeatedly reallocated. It becomes much less pronounced when the vector becomes larger. This should be a good enough reason to use the reserve method to optimize on unnecessary reallocation and copying.

If you now notice, in this piece of code, we are not retaining using the vector after any of the Foo objects f1, f2, f3 or f4 go out of scope. So, if we could store references to the Foo objects in the vector, it would have been a lot more efficient - with almost no copying needed. Here is the code for it. You can compare its output with that of the code above.

Show line numbers

    1 #include <iostream>
    2 #include <vector>
    3 #include <algorithm>
    4 #include <boost/ref.hpp>
    5 using std::cout;
    6 using std::endl;
    7
    8 struct Foo {
    9     Foo() : id_(++count) {
   10         cout << "Object constructed: id = " << id_ << endl;
   11     }
   12
   13     const int id() const {
   14         return id_;
   15     }
   16
   17     Foo(const Foo& srt) : id_(++count) {
   18         cout << "Object constructed: id = " << id_
   19              << ", copied from Object[ id = " << srt.id() << " ]" << endl;
   20     }
   21
   22     Foo& operator = (const Foo& srt) {
   23         return *this;
   24     }
   25
   26 private:
   27     const int id_;
   28     static int count;
   29 };
   30
   31 int Foo::count = 0;
   32
   33 template<class T>
   34 struct Bar {
   35     int operator()(T t) {
   36         cout << "Object[ id = " << t.id() << " ]" << endl;
   37         return 0;
   38     }
   39 };
   40
   41
   42 int main() {
   43     using std::vector;
   44     Foo f1, f2, f3, f4;
   45
   46     vector< boost::reference_wrapper<Foo const> > foos;
   47     foos.reserve(4);
   48     foos.push_back(boost::cref(f1));
   49     foos.push_back(boost::cref(f2));
   50     foos.push_back(boost::cref(f3));
   51     foos.push_back(boost::cref(f4));
   52
   53     std::for_each(foos.begin(), foos.end(), Bar<Foo const&>());
   54 }

You'd see the following output - indicating that absolutely no copies needed to be made.

Object constructed: id = 1
Object constructed: id = 2
Object constructed: id = 3
Object constructed: id = 4
Object[ id = 1 ]
Object[ id = 2 ]
Object[ id = 3 ]
Object[ id = 4 ]

This is an important optimization - the kind there is a lot of scope for, and so you should always look out in your code for an opportunity to do such optimizations, using boost::ref and boost::cref.

5. Expressive use of STL algorithms with lambda functions

This short section is the least trivial of all the "week 1" techniques we are going to discuss here, because here I will whet your appetite for the enormous amounts of functional programming that Boost supports. We will take a look at the Lambda library - a library for creating unnamed functions, based on expression templates.

The beauty and power of the Lambda library lies in the fact that one can create specialized functions (or rather functors or function objects) on the fly and invoke them by various means. These functions don't need a name and have an implicit prototype that can easily change.

In the example in the section 2 (on string algorithms) above, we extracted the AccountSummary data for each individual from the delimited text stream. The data is now present in the form of a vector. What if we want to write this data in a suitable format to the standard output. Say, we want to output each record in a block, like:

Chaim Rabinowitz
-----------------
age: 69
bank: Citibank N.A.
branch: Tel-Aviv
account_number: 5629311372
balance: 100000
-----------------

Mordechai Silberman
-----------------
age: 43
bank: Carmel Bank
branch: Haifa
account_number: 46268675539
balance: 20000
-----------------
...

We could write a function object and pass it to the std::for_each algorithm as the applicative functor to do the job for us - for_each iterates over each element in the vector and prints it in the format supported by the functor. But following this, you discover that you also need to put this data in a different delimited format like:

Chaim Rabinowitz|age: 69|bank: Citibank N.A.|branch: Tel-Aviv|account_number: 5629311372|balance: 100000
Mordechai Silberman|age: 43|bank: Carmel Bank|branch: Haifa|account_number: 46268675539|balance: 20000

What do you do - fire up your C++ editor and add another piece of functor. The number of formats you want your data to be printed in is really open-ended - may be you just want to summarize the name and age of the individuals - yet another functor? You must be thinking by now that you want an easier way to do this. After all you might be defining all these functors in some other source or header file, while you might be using them somewhere completely removed from the place they are defined in. Your operative call would probably just look like:

for_each(account_data.begin(), account_data.end(), PrintBlocks());
for_each(account_data.begin(), account_data.end(), PrintPipeDelimited());
for_each(account_data.begin(), account_data.end(), PrintSummary());

There is nothing wrong with such code provided we have reasonable names for the functor classes like PrintBlocks, PrintPipeDelimited or PrintSummary (all of which are assumed to have a trivial default constructor here). But can you look beyond the PrintBlocks name and figure out the format in which the data is actually printed by it - or figure out what constitutes a summary in PrintSummary. You cannot - and you don't want to depend on an aspect of programming that depends on appropriateness of naming identifiers. Besides, such formatting detail is not worth encapsulating in functors and storing away - how much reusable are these functors really going to be. This is where Boost Lambda helps you in being expressive. Below is the slightly curious looking rewrite of the above code using Boost Lambda, and with all the formatting logic spelled out in-place. The code essentially extends whats's already there, so only the newer parts of the code are provided, with only a bit of the old code for context. Please don't be exasperated by syntactic unfamiliarity - try to get the big picture of what's being done.

Show line numbers

  1-4//...
    5 #include <boost/lambda/lambda.hpp>
    6 #include <boost/lambda/bind.hpp>
    7 #include <algorithm>
 8-16//...
   17 using std::for_each;
   18 using boost::lambda::bind;
   19 using boost::lambda::constant;
   20 using boost::lambda::_1;
   21
   22 struct AccountSummary {
   23     string name;
   24     int age;
   25     string bank;
   26     string branch;
   27     string account_number;
   28     double balance;
   29
   30     AccountSummary() : age(0), balance(0.0) {
   31     }
   32 };
   33
   34 int main()
   35 {
   36     string str = "{name:Chaim Rabinowitz;acc:5629311372;bank:Citibank N.A.;branch:Tel-Aviv;age:69;balance:100000.00}"
   37                  "{name:Mordechai Silberman;acc:46268675539;bank:Carmel Bank;branch:Haifa;age:43;balance:20000.00}";
   38
39-44//...
   45     for (tokenizer::iterator tok_iter = records.begin(); tok_iter != records.end(); ++tok_iter) {
46-85//...
   85         account_data.push_back(next_acc_data);
   86     }
   87
   88     typedef vector<AccountSummary>::value_type elem_type;
   89     for_each(account_data.begin(), account_data.end(),
   90                cout << '\n' << "Block:" << '\n'
   91                     << "--------------------" << '\n'
   92                     << bind(&elem_type::name, _1) << constant('\n')
   93                     << constant("-----------------") << constant('\n')
   94                     << constant("age: ") << bind(&elem_type::age, _1) << constant('\n')
   95                     << constant("bank: ") << bind(&elem_type::bank, _1) << constant('\n')
   96                     << constant("branch: ") << bind(&elem_type::branch, _1) << constant('\n')
   97                     << constant("account_number: ") << bind(&elem_type::account_number, _1) << constant('\n')
   98                     << constant("balance: ") << bind(&elem_type::balance, _1) << constant('\n')
   99                     << constant("-----------------") << constant("\n\n")
  100            );
  101
  102     for_each(account_data.begin(), account_data.end(),
  103                cout << '\n' << "Delimited:" << '\n'
  104                     << "--------------------" << '\n'
  105                     << bind(&elem_type::name, _1) << constant('')
  106                     << constant("age: ") << bind(&elem_type::age, _1) << constant('')
  107                     << constant("bank: ") << bind(&elem_type::bank, _1) << constant('')
  108                     << constant("branch: ") << bind(&elem_type::branch, _1) << constant('')
  109                     << constant("account_number: ") << bind(&elem_type::account_number, _1) << constant('')
  110                     << constant("balance: ") << bind(&elem_type::balance, _1) << constant('\n')
  111            );
  112
  113     for_each(account_data.begin(), account_data.end(),
  114                cout << '\n' << "Summary Information:" << "\n\n"
  115                     << "--------------------" << '\n'
  116                     << constant("name: ") << bind(&elem_type::name, _1) << constant(", ")
  117                     << constant("age: ") << bind(&elem_type::age, _1) << constant('\n')
  118            );
  119
  120     return EXIT_SUCCESS;
  121 }

The output of this code is of the following form:


Block:
--------------------
Chaim Rabinowitz
-----------------
age: 69
bank: Citibank N.A.
branch: Tel-Aviv
account_number: 5629311372
balance: 100000
-----------------

Mordechai Silberman
-----------------
age: 43
bank: Carmel Bank
branch: Haifa
account_number: 46268675539
balance: 20000
-----------------


Delimited:
--------------------
Chaim Rabinowitzage: 69bank: Citibank N.A.branch: Tel-Avivaccount_number: 5629311372balance: 100000
Mordechai Silbermanage: 43bank: Carmel Bankbranch: Haifaaccount_number: 46268675539balance: 20000


Summary Information:
--------------------
name: Chaim Rabinowitz, age: 69
name: Mordechai Silberman, age: 43

If you see, we have printed the data in the AccountSummary records present in the vector in three different formats - Blocked, Delimited and Summary information. We didn't use a for-loop but rather the for_each algorithm to iterate through the records in each vector. The for_each algorithm takes three arguments - the two iterators that define the range of records to be handled, and a function pointer or functor which takes a single element of the container (in this case an AccountSummary object) and does some processing on it. The for_each statements for the three different cases (Blocked, Delimited and Summary) appear at lines 89, 102 and 113 respectively. In each case, the first two arguments are the iterators marking the beginning and past-the-end location of the AccountSummary vector. But for the third argument - we have a Lambda expression or an unnamed function that evaluates to a function object.

A Lambda expression is really just an expression, whose value is a function object. Just as 5+3 is an expression which results in an integer object, the Lambda expression is an expression which generates a callable entity - a functor object. The type of this functor object is a derived type which is of no consequence here. The Lambda expression is really a way of composing smaller function objects into a more complex function object, using overloaded operators to string together the smaller function objects. Thus expressions such as bind(...), constant(...) and _1 are all function objects with overloaded operators between them. Indeed, a Lambda expression is an excellent example of functional composition - the core idea in functional programming. More explanation follows.

In each of these Lambda expressions, the curious looking _1 represents a "placeholder" for the first argument to the function. Because a Lambda expression evaluates to a functor of some sort and this functor has a signature of its own - it is but natural that we need a way to reference the arguments to this functor inside the Lambda expression. Boost's Lambda expressions support upto 9 arguments - by default represented by the placeholders _1 through _9. They are just normal C++ objects (of some type, not important here) and intelligently named using the fact that any valid C identifier can start with an underscore. In this case, there is only one argument to deal with - the current element in the vector - so we only need _1. But we need to get at its members.

An expression like bind(&elem_type::name, _1) creates a functor which returns the value of the data member name of the object represented by _1, whose type is elem_type. Here elem_type is typedef'd to mean the value_type of the vector (i.e. AccountSummary). For example, if you were to use for_each to iterate over a map of string keys and double values, then at each step you'd be getting elements whose type is std::map<string, double>::value_type. This is essentially an std::pair<string, double> with members first (string) and second (double). Then, in order to construct a functor which returns the string key, we would use a bind expression like this:

bind(std::map<string, double>::value_type::first, _1);

We can almost always abridge these expressions by using suitable typedefs. For example:

typedef std::map<string, double>::value_type map_elem;
bind(map_elem::first, _1);

Typically, we should try to use a single typedef'd name, corresponding to the type of _1 (or whichever argument placeholder).

Expressions like constant("name: ") or constant('\n') represent functors which return a constant value. We don't want literals in the lambda expression - rather function objects which return such literals. That's because they need to be invoked for each iteration - not just printed once. However, you would have noticed that "Delimited:" or "Block:" are not enclosed in constant(...). In the output, you should see that name: or age: are repeated for each record - these were enclosed in constant(...) - but "Delimited:" or "Block:" were printed only once - because these were not enclosed in constant(...).

In the same way as constant, there is another construct - var, which can wrap non-literal object instances. We haven't used it here but we could.

When all of these are combined into a single Lambda expression, it represents a callable entity - essentially a generated super-function, built from the above individual functional units. By now, you should have a gut feel of how Lambda expressions work - but I will leave you with another example to squint your eyes at.

The real power of Lambda expressions is the ability to define a callable entity at the site of invocation in a fairly intuitive and succinct manner. There is a syntactic baggage associated - and I feel that's the only factor that weighs against Lambda expressions. But it requires a bit of mastering, and you'd be on your way to leveraging the full power of Lambda expressions.

In order to get a hang of how functional composition works, read the two part article Expression Templates demystified.

6. Platform-independent code to manipulate files on the disk

7. Quick and easy threads

8. An extensible way to handle command-line arguments

9. Matching patterns and replacing text using Regular Expressions

Regular expressions are an important programming tool and is extensively used in applications for filtering and parsing text, and matching names of objects against name patterns. Every language used for application programming owes itself a regular expression facility, and with the slick and fast Boost.Regex, C++ gets its first standard Regular Expression library.

This section cannot serve as an introduction to the theory and syntax of regular expressions. Assuming that you have a working knowledge of Regular expression syntax as used by grep and egrep on Unix, or the Perl language, we get started here with a few short examples of how to use the Boost.Regex library.

To use the Boost.Regex library, you should know its basic object model - the main classes whose objects you need to define and use for your Regex work.

A regular expression is encapsulated by an instance of the class boost::regex. In fact, boost::regex is a typedef for boost::basic_regex<char>. There is a wide character version available - boost::wregex, which is a typedef for boost::basic_regex<wchar_t>.

The result of trying to search for a regular expression in an input can result in matches at several levels - including matches for sub-expressions within regular expressions. All such details of matches are reported back in an object of type match_results. Once again, match_results is a class template, and there are concrete instantiations like cmatch, smatch and their wide character variants (wcmatch and wsmatch) which correspond to specializations for C-strings (char*), C++ strings and wchar_t* strings.

The iterator regex_iterator which helps you run through the input sequence identifying all matches for a particular regular expression. This is a really handy tool.

Finally, the regex_token_iterator iterator adaptor - which makes splitting input based on delimiters (like split in Perl) as easy as it gets in C++.

In our first example of this section, we will write a simple egrep like utility which looks for regular expression patterns in a text stream, and prints out lines matching the pattern. It also prints the line number, and the segment of the line that matched the regular expression (in brackets). The program is invoked with a regular expression and one or more input file names as command-line arguments.
Show line numbers

    1 #include <boost/regex.hpp>
    2 #include <iostream>
    3 #include <fstream>
    4 #include <string>
    5 #include <algorithm>
    6 #include <exception>
    7 
    8 using std::cout;
    9 using std::cerr;
   10 using std::endl;
   11 using std::ifstream;
   12 using std::string;
   13 using std::exception;
   14 
   15 int main(int argc, char *argv[])
   16 {
   17     if ( argc <= 2 ) {
   18         cerr << "Usage: " << argv[0] << " <regex> <file-list>"<< endl;
   19         return EXIT_FAILURE;
   20     }
   21 
   22     string str;
   23     ifstream in;
   24     const char * file_name = 0;
   25 
   26     try {
   27         boost::regex reg(argv[1]);
   28         while ( ( file_name = argv[--argc] )
   29                 && argc > 1 )
   30         {
   31             in.open(file_name, std::ios::in|std::ios::beg);
   32
   33             if ( in ) {
   34                 cout << file_name << endl << "==================="<< endl;
   35                 boost::smatch m;
   36                 int line_no = 0;
   37                 while( getline(in, str) ) {
   38                     ++line_no;
   39                     if ( boost::regex_search(str, m, reg) ) {
   40                         cout << line_no << ": " << str << "  [matches: " << string(m[0].first, m[0].second) << "]" << endl;
   41                     }
   42                 }
   43             } else {
   44                 cerr << file_name << ": Could not read." << endl << "==================="<< endl;
   45                 if ( !in ) in.clear();
   46             }
   47         }
   48     } catch(exception& e) {
   49         cerr << "Exception caught: " << e.what() << endl;
   50         return EXIT_FAILURE;
   51     } catch(...) {
   52         cerr << "Exception caught." << endl;
   53         return EXIT_FAILURE;
   54     }
   55 }

If you notice, the above example has a small issue. It picks up each line, searches for a match for the regular expression, and as soon as the first match is found, it prints the line number, line and the matching segment (in brackets). It then moves to the next line. If there are multiple segments in a single line matching the given regular expression, we miss all of those segments except the first one.

While it is possible to use the same regex_search function in a loop and using string iterators to run through all matches per line of text, the code doesn't look very good. In case you have to run through all distinct segments within an input sequence that match a regular expression, use the regex_iterator. A regex_iterator "points to" a match_result object. A match_result object represents a segment of input that matches a regular expression (as well as sub-expression matches, etc.). Thus, using regex_iterator, we loop over all matches in an input sequence. Here is the slightly altered code ... only the changed lines are highlighted.

Show line numbers

   37 while( getline(in, str) ) {
   38     ++line_no;
   39     boost::sregex_iterator it(str.begin(), str.end(), reg), end;
   40     string matchlist;
   41     while ( it != end ) {
   42         matchlist += (*it++)[0].str() + " ";
   43     }
   44     if ( matchlist.length() ) {
   45         cout << line_no << ": " << str << "  [matches: " << matchlist << "]" << endl;
   46     }
   47 }

There are just three things to note in the above example. On line 39, we instantiate two objects of sregex_iterator - this is a typedef for regex_iterator. The second of these objects (end)is default constructed and serves as an end-marker or null-marker. Finally, on line 42, (*it) is an smatch (match_result) object and (*it)[0] represents the part of input that matched the whole Regular Expression. If there are N sub-expressions within the regular expression, (*it)[1] through (*it)[N] represent parts matching those sub-expressions. The str() member function gives the matching part as a C++ string.

10. Easy date and time measurements

Having come thus far, I must clarify that these ten libraries are in no way the ten most important libraries to know in Boost. There are many others which find no mention in this article - Boost.Interprocess, Boost.Asio, Boost.IoStream, Boost.Graph and the seminal Boost.MPL (The Boost Template Metaprogramming Library) among many others. The reason for not having covered these is that these are more involved libraries and we need to devote more space and time to each one to leverage their power. Therefore, they don't belong to a '10 nifty tricks' list. I'll be back soon with some of these topics.

2 comments:

Michel Boto said...: It's not technically true that you can never leak memory again if you use shared pointers in your code. You're omitting the case of a cyclic reference:

class A {
boost::shared_ptr<B> m_b;
};

class B {
boost::shared_ptr<B> m_a;
};

In this scenario, neither an instance of A nor B will be destroyed so long as the member variables a and b point to each other. In other words, you will leak both instances because the reference count of the shared pointers will always be (at a minimum) 1 and the destructors will never be called.

Here is where another boost class, weak_ptr, comes into importance. If m_b is now a weak_ptr to A instead of shared, then it no longer has an influence on the ownership of the instance of A, which means two things: 1) you can point a at B::m_a and b at A::m_b without worries; and 2) B::m_a might be destroyed at any time, so you have to always check if the pointer is no longer valid by calling if ( m_a.expired() ) or converting it to a shared pointer and checking for NULL with if ( !m_a.lock() ). If you don't want to do that, another alternative is to make sure the code is designed to never have cyclic dependencies. But these are often hard to track down (it might not be A::m_b, but rather A::m_c, C::m_d, etc. till Z::m_b), and in legacy code you may never get the opportunity to redesign.

I would say weak_ptr is in itself important enough to be on your list. These days I don't dare share heap-allocated resources between multiple threads without it. If only one thread has ownership of the resource and can destroy it at any time, it's very handy to have a way of never worrying about dangling pointers or references to invalid instances. As soon as any other thread attempts to lock the weak_ptr and convert it into a shared_ptr, it's clear that it's now NULL and is no longer safe to use. It's true that I've oversimplified the synchronization of the resource a bit, but I'm heading out the door.; 4 Aug 2008, 13:32:00
Arindam Mukherjee said...: You're correct in emphasizing the importantce of weak_ptr - I will include an appropriate example of its use - may be a linked list example.

For the benefit of other readers, there is a small typo in Mozart's Ghost's comment:

class A {
boost::shared_ptr<B> m_b;
};

class B {
//boost::shared_ptr<B> m_a;
boost::shared_ptr<A> m_a;
};; 4 Aug 2008, 15:42:00

Tech Musings and Assorted Thoughts

Saturday, July 12, 2008

Getting Started with Boost