An Introduction to Clang Part 2

By Mark Wilson | Tuesday, July 2, 2013

Introduction

As promised, here is a follow-up to An Introduction to Clang. I mentioned in that posting the cool thing about Clang is that it is library based and offers public APIs that allows one to access information about a C or C++ program with relative ease. In this post we will work through an example using Clang's API to write our own "baby IDE" that parses C and C++ code and performs syntax highlighting.

Clang consists of a number of different libraries and of primary interest at this time, is libclang. Clang's evolution has been rapid so most of the libraries and the APIs that make it up are somewhat volatile, which makes it more difficult to build and maintain tools based on them. This is where libclang comes in: it is intended to be a stable C API into Clang, doesn't change nearly as much as the other constituent libraries, and is relatively easy to use. Athough the current documentation leaves something to be desired. hopefully this posting will help ease that pain a little.

First, we'll talk about a few compiler basics that one needs to know to work with the Clang API. A compiler's basic function is to translate one form of code (your program) into some other useful form. In order to translate your program the compiler needs to read your code, understand it (looking for errors along the way), and then turn it into some other form. Clang translates your C++ code into assembler language suitable for a targeted microprocessor to run, while a Java compiler turns your code into virtual machine byte code.

The first step in compiling a program is to tokenize it. The compiler scans the text of your code, breaking it into tokens. Tokens include numeric and string literals, punctuation, language keywords and variable symbols. For instance, class is a C++ keyword token. The process of tokenization produces a stream of tokens, which are parsed to ensure correct syntax and to discover the inherent structure of your program.

During the parsing process, compilers create something called an Abstract Syntax Tree (AST), which reflects the syntactical structure of your linear human readable code in a tree structure. Each node in the tree represents a C++ construct, such as an expression. An AST is an intermediate form of your program that simplifies the compiler's job. Punctuation and delimiters are removed; for instance, parentheses are implicit in the structure of the AST, as are function/method brackets. In addition, an AST removes any ambiguities in your code that are inherent to C++ syntax. An AST can be easily annotated with information about your program; as the compiler goes along, it can add whatever information it needs for later steps in the compilation process. Further compiler steps ensure type and const correctness and check other semantic information. The image below shows a simple example of an AST for a C++ if statement. For our purposes in this post we only need to care about tokens and the AST so we'll end the basics here.

Before we move on to source code we need to understand some important terms:

Index

An "index" is a libclang data structure that consists of a set of translation units that would typically be linked together into an executable or library.

Translation Unit

A translation unit is the basic unit of compilation in C++. It consists of the contents of a single source file plus the contents of any header files directly or indirectly included by it, minus those lines that were ignored using conditional preprocessing statements. There may be many translation units in an index.

Cursor

A term used in the Clang API. A cursor is a "pointer" to an element in the AST. The element could be a function or method, a class declaration, a mathematical expression or a number of other C++ constructs. Cursors can be hierarchical in nature. For instance, a method declaration has as children of the method cursor the parameters of the method.

Figure 1 - Example Abstract Syntax Tree (AST)

As mentioned earlier we will create a simple read-only "baby IDE" application using Qt and libclang. The application will allow you to open a C++ source or header file, which it will parse via libclang, and display the code as color highlighted text; green for C++ keywords, red for string and numeric literals, blue for comments, and black for everything else. Below is a screenshot of the application displaying a C++ header file. The code is displayed in a QTextEdit widget.

Most of the action happens in the MainWindow class, which is sub-classed from QMainWindow. The constructor sets up the GUI (lines 6 and 7), creates a color mapping indexed by token type (lines 9-13), and creates a Clang CXIndex structure (line 15):

Listing 1 - MainWindow constructor in MainWindow.cpp

  1 MainWindow::
  2 MainWindow(QWidget* parent) :
  3     QMainWindow(parent),
  4     transUnit_(0)
  5 {
  6     ui_ = new Ui::MainWindow;
  7     ui_->setupUi(this);
  8 
  9     highlightMap_[CXToken_Punctuation] = QBrush(QColor("black"));
 10     highlightMap_[CXToken_Keyword] = QBrush(QColor("green"));
 11     highlightMap_[CXToken_Identifier] = QBrush(QColor("black"));
 12     highlightMap_[CXToken_Literal] = QBrush(QColor("red"));
 13     highlightMap_[CXToken_Comment] = QBrush(QColor("blue"));
 14 
 15     index_ = clang_createIndex(0, 0);
 16 }

 

Figure 2 - Screen Shot of User Interface

Code highlighting is kicked off when the user activates the Open File menu action, which results in the invocation of on_actionOpen_File_triggered.

On line 4 we invoke a QFileDialog to obtain a C++ source or header file. We open the file, read it into a QByteArray, and populate the QTextEdit widget. Then the interesting part begins. The function clang_parseTranslationUnit (line 15) parses the selected file, using the arguments in the args variable (line 13). Since the .h extension is ambiguous (it could be either a C or a C++ header), we need to tell Clang that we are specifically parsing C++ code; the "-x c++" argument accomplishes this. We also don't want the compiler to think we are building an executable, hence the "-c" argument. You can pass any argument via the clang_parseTranslationUnit function that you would pass on the command line, including successive "-I" options to specify include paths. Once we have a translation unit, we obtain the top, starting cursor (line 19). Then the fun starts on line 20. Clang uses the Visitor Pattern to traverse the AST that is generated by the parsing process; the global function visitor is called for each node in the AST.

Listing 2 - MainWindow on_actionOpen_File_triggered() slot in MainWindow.cpp

  1 void MainWindow::
  2 on_actionOpen_File_triggered()
  3 {
  4     path_ = QFileDialog::getOpenFileName(this,
  5         "Select a C++ source file", "./exampleSrc", "*.cc *.cpp *.h");
  6 
  7     QFile file(path_);
  8     file.open(QIODevice::ReadOnly);
  9     QByteArray contents = file.readAll();
 10     ui_->codeEditor->setText(contents);
 11     file.close();
 12 
 13     const char* args[] = { "-c", "-x", "c++" };
 14 
 15     transUnit_ = clang_parseTranslationUnit(index_,
 16         path_.toStdString().c_str(), args, 3, 0, 0,
 17             CXTranslationUnit_None);
 18 
 19     CXCursor startCursor = clang_getTranslationUnitCursor(transUnit_);
 20     clang_visitChildren(startCursor, visitor, this);
 21 }

The visitor function takes the current cursor and that cursor's parent cursor as parameters, plus clientData, which is a void pointer to any context data that may be needed for your particular application; in this example, clientData is a pointer to the MainWindow widget (line 4). This allows us to call MainWindow::highlightText (line 35). After we get the pointer to the MainWindow widget, we obtain the source location from cursor (line 11). Part of the parsing process is to record the line number, column, and source file for every token in your code. This is one important aspect of how libclang enables the creation of IDEs. On line 14, we check the file in which the cursor refers to is from the file we are viewing. If it isn't we return CXChildVisit_Continue, which tells the visitor algorithm to go to the next cursor (depending on the type of cursor passed, we could return CXChildVisit_Recurse to traverse the children of the cursor, but we don't need to do that here).

If our cursor is from our file of interest, we obtain the translation unit from the cursor (every cursor has a pointer to its associated translation unit), and then find the source range of the cursor (lines 18 and 19). The source range tells us where the cursor starts, and where it ends as a cursor can span multiple lines. We then tokenize everything in the cursor's range (lines 21-23). Every cursor is made up of one or more tokens. We can now iterate through the tokens that make up our cursor (lines 27-38), determine the exact location of that token (lines 31 and 33), and pass that location, along with the token type, to MainWindow::highlightToken (lines 35 and 36). After we have processed the tokens, we tell the visitor algorithm to keep going (line 40); it will stop when it runs out of cursors to process. Note that we iterate through number of tokens minus 1. The tokenization process returns one token past the tokens that belong to the cursor. I'm not sure why, perhaps to provide for "look ahead" capability (I told you the documentation left something to be desired!).

Listing 3 - visitor() function in MainWindow.cpp

  1 CXChildVisitResult
  2 visitor(CXCursor cursor, CXCursor parent, CXClientData clientData)
  3 {
  4     MainWindow* mw = static_cast<MainWindow*>(clientData);
  5 
  6     CXFile file;
  7     unsigned int line;
  8     unsigned int column;
  9     unsigned int offset;
 10 
 11     CXSourceLocation loc = clang_getCursorLocation(cursor);
 12     clang_getFileLocation(loc, &file, &line, &column, &offset);
 13 
 14     if ( QString::fromStdString(GetClangString(clang_getFileName(file))) !=
 15       mw->getCurrentPath() )
 16         return CXChildVisit_Continue;
 17     
 18     CXTranslationUnit tu = clang_Cursor_getTranslationUnit(cursor);
 19     CXSourceRange range = clang_getCursorExtent(cursor);
 20     
 21     CXToken* tokens;
 22     unsigned int numTokens;
 23     clang_tokenize(tu, range, &tokens, &numTokens);
 24     
 25     if ( numTokens > 0 )
 26     {
 27         for (unsigned int i=0; i<numTokens-1; i++)
 28         {
 29             std::string token = GetClangString(clang_getTokenSpelling(tu,
 30                 tokens[i]));
 31             CXSourceLocation tl = clang_getTokenLocation(tu, tokens[i]);
 32             
 33             clang_getFileLocation(tl, &file, &line, &column, &offset);
 34             
 35             mw->highlightText(line, column, token.size(),
 36                 clang_getTokenKind(tokens[i]));
 37         }       
 38     }   
 39     
 40     return CXChildVisit_Continue;
 41 }

MainWindow::highlightText creates a QTextCursor (line 5) on the QTextDocument that contains the source code we are highlighting. It then performs a series of movements to drive the cursor to the correct location (lines 7 to 14), where it then creates a selection (lines 16 and 17). We then tell the QTextEdit widget about the new cursor (line 19), and finally highlight the text (lines 21 to 24).

Listing 4 - MainWindow highlightText() method in MainWindow.cpp

  1 void MainWindow::
  2 highlightText(unsigned int line, unsigned int column, unsigned int stride,
  3     CXTokenKind tokenKind)
  4 {
  5     QTextCursor cursor(ui_->codeEditor->document());
  6 
  7     cursor.movePosition(QTextCursor::Start,
  8         QTextCursor::MoveAnchor);
  9 
 10     cursor.movePosition(QTextCursor::NextBlock,
 11         QTextCursor::MoveAnchor, line-1);
 12 
 13     cursor.movePosition(QTextCursor::Right,
 14         QTextCursor::MoveAnchor, column-1);
 15 
 16     cursor.movePosition(QTextCursor::Right,
 17         QTextCursor::KeepAnchor, stride);
 18 
 19     ui_->codeEditor->setTextCursor(cursor);
 20 
 21     QTextCharFormat format;
 22     format.setForeground(highlightMap_[tokenKind]);
 23 
 24     ui_->codeEditor->setCurrentCharFormat(format);
 25 }

You can see that the Clang API makes it easy to explore and understand C++ source code. This posting has barely scratched the surface of what is possible with Clang. In terms of IDE development, the API provides facilities for code completion and refactoring, and provides a lot more information about your source than we used in this example. Having the power of the Clang API also opens up many opportunities for generating code based on C++ source (e.g. serialization code). Plug-ins can also be written for Clang. This opens up the possibility of parsing non-standard code in a C++ source file à la Qt signals and slots. Future postings may explore some of this.

Dependencies

The example code can be downloaded from here.

The example was developed on Ubuntu Linux and should work against Qt 4 and 5 but needs at least Clang 3.3. I had to build 3.3 from source (my distribution had only 3.0 available; type clang --version at a command line to verify what version is installed on your system). It is not difficult to build Clang and here you will find the directions.

You may need to change the INCLUDEPATH and LIBS variables in the clang-blog-example.pro project file if Clang is installed in a different place.

References

  1. Clang Doxygen documentation
  2. The Visitor pattern
  3. A talk from C++ Now 2012 about Clang

Comments

Comment: 

Hi,

I found this article really interesting and thanks for having written it. Anyway, I downloaded the source code, compiled it but nothing happens. I thinks there is someting missing in the source code cause there is no color when I open a .cpp file. Really nothings happens. Could you upload the fixed code please ?

Thanks in advance

Comment: 

Ok my bad. I used qmake wich use qt4 version. I did make clean then qmake-qt5 and now everything works well. Thanks for your reply.

Hope there will be more posts on Clang and cheers from Cologne. ;)