Trie Tree Implementation


Hello people…! In this post we will implement an amazing data structure, the Trie Tree. Trie Trees are are used to search for all occurrences of a word in a given text very quickly. To be precise, if the length of the word is “L“, the trie tree searches for all occurrences of this data structure in O(L) time, which is very very fast in comparison to many pattern matching algorithms.

But I must mention, this data structure is not exactly used for “pattern matching”, it is used to search for the occurrences of the word in the given text. How these both functionalities differ…? We’ll get to know that shortly. The Trie Tree has many applications. Your browser could be using a trie tree internally to search for words when you press Ctrl + F. So, let’s get started with this data structure…!

The Trie Tree is a very straight forward data structure. It is a simple tree where the nodes have an alphabet associated with them. And the way these nodes are arranged is the best part of the Trie Tree. To get an idea take a close look at the sketch below –

Structure of Trie Tree

Structure of Trie Tree

The arrows in the sketch above indicate how to traverse the trie tree in order to tell if a word exists or not. We travel through the root node, down the tree until we hit a leaf. Picking up a character at every edge, we construct our word. Now, you can easily tell why the time to search for a node in the text will be in the order of length of the word to be searched. It’s obvious…! One would have to go down till the leaf. In the worst case, one would have to travel the height of the tree which would be the length of the longest word in the whole text..! 😉

So, as we got a little idea about working with a Trie Tree, let us move on to the serious part. Let us talk about how this is implemented and then we will talk about three fundamental operations done on the Trie Tree –

  • Insert
  • Delete
  • Search

In C, the Trie Tree is implemented using structures. But the C implementation of this data structure gets a little dirty. So we will switch to C++ but we will keep it as C-ish as possible. As we keep discussing about the implementation, you will notice how many advantages we have when we use C++. This will be our structure of each node in the tree –

struct node
{
	struct node * parent;
	struct node * children[ALPHABETS];
	vector<int> occurrences;
};
  • parent – This points to the node which is the parent of the current node in the tree hierarchy. It may seem useless at the first, but we need this to travel up the tree when we want to delete any word. We’ll get to the deletion operation in a moment.
  • children – It points to the children nodes. It is made an array so that we can have O(1) accessing time. But why is the size 26…? Its simple. Consider a word, “th”, now, what could the third letter possibly be, if it had one…? One among the 26 english alphabets…! So that’s why the size is made 26. If you want to make a trie tree for another language, replace the number 26 with the number of letters in your language.
  • occurrences – This will store the starting indices of the word occurring in the given text. Now, why this is made a vector is that, vector is as good as a Linked List with random access. It is one of the most handy ready made data structures available in the C++ STL Library. If you are not familiar with vectors this is a good place to start.
    If this were C, we would have to give a fixed array size, and we would have a problem if the occurrences of a particular node are more. We could avoid this by putting a Linked List there. But we sacrifice random access and a whole lot of operations get time taking. Moreover, the code will get really really cumbersome to manage if you have a tree and a linked list.

Having got a picture of the implementation, let us look at how the operations are done in a Trie Tree.

Insert Operation

When we do an insert operation, there are a few cases –

  1. The word to be inserted does not exist.
  2. The word to be inserted already exists in the tree.
  3. The word to be inserted does not exists, but as the suffix of a word.

The first case is simple. One would have to traverse till the alphabets of the words have nodes in the trie tree or else create new nodes one-after-the-other. And at the end of the word, i.e., the node for the last alphabet, we will mark it as a leaf and push the starting index into the vector indicating the occurrence of the newly inserted word.

During this course of traversal, we will be cutting off the string of the word we have one-by-one as they are processed. This is done by putting using a vector of characters and popping off one character after the other. This is less code to handle and more efficient as we can use a vector as a queue. This is another advantage of using C++.

After having learnt what to do with the first case, you can guess what we would have to do in the second case. We simply have to add a new value to the occurrences vector at the node corresponding to the last alphabet of the word. We can also know the number of occurrences in constant time, we simply return the size of the vector. This is another advantage of using C++.

To understand the challenge in the third case, let’s take a simple example. What would you do with your trie tree if you wanted to insert the word “face” if the word “facebook” is already there in your tree…? This is the third case. The answer to this is the occurrence vector itself. We simply push the starting index of the word into the vector of that node which corresponds to the last alphabet of the word to be inserted, in the above example, this would be the node of “e”. So, what really tells if there’s a word ending with the alphabet corresponding to the current node is the size of the vector.

So I hope you understand how important our vector is. A lot depends on it…!

Delete Operation

The deletion of a word in the trie tree is similar to the insertion, we have a few cases –

  • Error 404 : Word not found…!
  • Word exists as a stand-alone word.
  • Word exists as a prefix of another word.

If the word is not there at all, we don’t have to do anything. We just have to make sure that we don’t mess up the existing data structure…!

The second case is a little tricky. We would have to delete the word bottom-up. That is, we will delete that part of the word which is not a part of any other word. For example, consider the sketch above. If we were to delete “this”, we would delete the letters ‘i’ and ‘s’ as, ‘h’ is a part of another word. This keeps us away from distorting the data structure. If the word were existing multiple number of times we will simply remove the occurrence from the vector of the concerned node.

In the third case too, we will simply delete the occurrence of the word from the vector. We needn’t write a lot of code as we can use the functions in algorithm header file. This is another advantage of using C++.

Note – When we delete the occurrence of a word, we are not concerned about the validity of the indices stored as occurrences of other words. What I mean to say is, suppose we have 10 words. If we delete the 3rd word, the 5th word or the 9th word is supposed to become the 4rth and the 8th word as far as the original text is concerned. But we will not consider this. The data stored in the trie tree is not meant to me deleted or inserted. The Trie Tree is meant for processing the given text not to manipulate the given text.

Search Operation

The search operation is simple and is as we discussed when we began our discussion about the Trie Tree. We go down the tree traversing the nodes and keep “picking up characters” as we go. And the occurrences vector tells us if a word exists that ends with the alphabet associated with the current node, and if so, it gives us the indices of occurrences and also the number of occurrences.

Besides these basic operations, there is another very interesting operation that is done with the Trie Tree –

  • Lexicographical Sort – If we want to print all the words processed into the trie tree lexicographically, all we have to do is do a Preorder Walk on the tree. This will automatically print all the words in the lexicographical order or the dictionary order. This is due to the very structure and arrangement of nodes in a Trie Tree. Now, I’d like to share another interesting thing about pre-order walk in trees… The Pre-order walk works exactly as a Depth First Search (DFS) in graphs, i.e., the sequence in which both the algorithms visit the nodes is the same. Think about this for a while and word out the two algorithms on an example (you could take mine in the sketch above), and you can see why it is so. You will also notice why the printed words would be lexicographically sorted.

Now, having learned a lot about the trie tree, try coding it in C++. If you are uneasy with C++, you can try it in C, but make sure you try at least 3 times. Trying is very important. I don’t know if you are new to reading my posts, but I insist a lot on trying in every post of mine…! If you have succeeded, you’re fabulous…! If not, check out my code below any try figuring out how just close you were…!!

/* ==========  ========== ========== ========= */
//          Trie Tree Data Structure           //
//                using C++ STL                //
//                                             //
//         Functions follow Pascal Case        //
//           Convention and Variables      	   //
//         follow Camel Case Convention        //
//                                             //
//            Author - Vamsi Sangam            //
//            Theory of Programming            //
/* ========== ========== ========== ========== */

#include <cstdio>
#include <cstdlib>
#include <vector>

#define ALPHABETS 26
#define CASE 'a'
#define MAX_WORD_SIZE 25

using namespace std;

struct Node
{
    struct Node * parent;
    struct Node * children[ALPHABETS];
    vector<int> occurrences;
};

// Inserts a word 'text' into the Trie Tree
// 'trieTree' and marks it's occurence as 'index'.
void InsertWord(struct Node * trieTree, char * word, int index)
{
    struct Node * traverse = trieTree;

    while (*word != '\0') {     // Until there is something to process
        if (traverse->children[*word - CASE] == NULL) {
            // There is no node in 'trieTree' corresponding to this alphabet

            // Allocate using calloc(), so that components are initialised
            traverse->children[*word - CASE] = (struct Node *) calloc(1, sizeof(struct Node));
            traverse->children[*word - CASE]->parent = traverse;  // Assigning parent
        }

        traverse = traverse->children[*word - CASE];
        ++word; // The next alphabet
    }

    traverse->occurrences.push_back(index);      // Mark the occurence of the word
}

// Searches for the occurence of a word in 'trieTree',
// if not found, returns NULL,
// if found, returns poniter pointing to the
// last node of the word in the 'trieTree'
// Complexity -> O(length_of_word_to_be_searched)
struct Node * SearchWord(struct Node * treeNode, char * word)
{
    // Function is very similar to insert() function
    while (*word != '\0') {
        if (treeNode->children[*word - CASE] != NULL) {
            treeNode = treeNode->children[*word - CASE];
            ++word;
        } else {
            break;
        }
    }

    if (*word == '\0' && treeNode->occurrences.size() != 0) {
        // Word found
        return treeNode;
    } else {
        // Word not found
        return NULL;
    }
}

// Searches the word first, if not found, does nothing
// if found, deletes the nodes corresponding to the word
void RemoveWord(struct Node * trieTree, char * word)
{
    struct Node * trieNode = SearchWord(trieTree, word);

    if (trieNode == NULL) {
        // Word not found
        return;
    }

    trieNode->occurrences.pop_back();    // Deleting the occurence

    // 'noChild' indicates if the node is a leaf node
    bool noChild = true;

    int childCount = 0;
    // 'childCount' has the number of children the current node
    // has which actually tells us if the node is associated with
    // another word .This will happen if 'childCount' != 0.
    int i;

    // Checking children of current node
    for (i = 0; i < ALPHABETS; ++i) {
        if (trieNode->children[i] != NULL) {
            noChild = false;
            ++childCount;
        }
    }

    if (!noChild) {
        // The node has children, which means that the word whose
        // occurrence was just removed is a Suffix-Word
        // So, logically no more nodes have to be deleted
        return;
    }

    struct Node * parentNode;     // variable to assist in traversal

    while (trieNode->occurrences.size() == 0 && trieNode->parent != NULL && childCount == 0) {
        // trieNode->occurrences.size() -> tells if the node is associated with another word
        //
        // trieNode->parent != NULL -> is the base case sort-of condition, we simply ran
        // out of nodes to be deleted, as we reached the root
        //
        // childCount -> does the same thing as explained in the beginning, to every node
        // we reach

        parentNode = trieNode->parent;

        for (i = 0; i < ALPHABETS; ++i) {
        	if (parentNode->children[i] != NULL) {
        		++childCount;

        		if (trieNode == parentNode->children[i]) {
	                parentNode->children[i] = NULL;
	                free(trieNode);
	        		trieNode = parentNode;
	            }
			}
        }
    }
}

// Prints the 'trieTree' in a Pre-Order or a DFS manner
// which automatically results in a Lexicographical Order
void LexicographicalPrint(struct Node * trieTree, vector<char> word)
{
    int i;
    bool noChild = true;

    if (trieTree->occurrences.size() != 0) {
        // Condition trie_tree->occurrences.size() != 0,
        // is a neccessary and sufficient condition to
        // tell if a node is associated with a word or not

        vector<char>::iterator charItr = word.begin();

        while (charItr != word.end()) {
            printf("%c", *charItr);
            ++charItr;
        }
        printf(" -> @ index -> ");

        vector<int>::iterator counter = trieTree->occurrences.begin();
        // This is to print the occurences of the word

        while (counter != trieTree->occurrences.end()) {
            printf("%d, ", *counter);
            ++counter;
        }

        printf("\n");
    }

    for (i = 0; i < ALPHABETS; ++i) {
        if (trieTree->children[i] != NULL) {
            noChild = false;
            word.push_back(CASE + i);   // Select a child

            // and explore everything associated with the cild
            LexicographicalPrint(trieTree->children[i], word);
            word.pop_back();
            // Remove the alphabet as we dealt
            // everything associated with it
        }
    }

    word.pop_back();
}

int main()
{
    int n, i;
	vector<char> printUtil;		// Utility variable to print tree

	// Creating the Trie Tree using calloc
	// so that the components are initialised
    struct Node * trieTree = (struct Node *) calloc(1, sizeof(struct Node));
    char word[MAX_WORD_SIZE];

	printf("Enter the number of words-\n");
    scanf("%d", &n);

    for (i = 1; i <= n; ++i) {
        scanf("%s", word);
        InsertWord(trieTree, word, i);
    }

    printf("\n");   // Just to make the output more readable
    LexicographicalPrint(trieTree, printUtil);

    printf("\nEnter the Word to be removed - ");
    scanf("%s", word);
    RemoveWord(trieTree, word);

    printf("\n");   // Just to make the output more readable
    LexicographicalPrint(trieTree, printUtil);

    return 0;
}

The code is highly commented with explanation. It is well described, but if you have any doubts regarding the data structure or the code, feel free to comment them. I have used a few macros. The macro CASE indicates for which case the Trie Tree works. If we mention ‘A’ as the macro, the Trie Tree will work for upper case words only.

Other Implementations

The code is well tested against fairly large input. You can download the test case file here – Trie Tree Input (PDF). You can just clock your code for the insert operations. My code took 1.236 seconds to execute that loop which reads a word and inserts it into the Trie Tree. There are 5000 words in total. The last word in the input file is the word to be deleted.

If you think you can give a suggestion to make my code better, please do comment them too. I appreciate your suggestions. For those there who are struggling to get their code right, “Keep the effort going guys”…! Remember, you won’t learn anything if you keep writing Hello World program again and again. So, keep practicing…! I hope my post has helped you in getting to know about the Trie Tree. If it did, let me know by commenting ! Happy Coding…! 😀

Advertisements

16 thoughts on “Trie Tree Implementation

  1. There’s one doubt I am having though.
    How is a trie useful if one has to atleast read all the characters from the input array of strings to insert the strings one by one into the trie?So the complexity is O(size_of_array).Suppose the input array of strings is {“hello”,world”,”stack”,”overflow”}.And we want to search for “stack”,then we would have to atleast traverse the whole array for inserting the keys into the trie.So complexity is O(size_of_array).We could do this without a trie.
    Please help.
    Thanks.

    Like

    • Ok… I think I get your argument… For the sake of simplicity, let’s say we have ‘N’ words of length ‘L’… In the case of a 2D array… Getting the data structure ready will cost us O(N) time… And subsequent searches for any word will cost us O(NL) time… Now, getting a trie tree ready will cost us O(NL) time, but the subsequent searches cost us only O(L) time… And that is what we are interested in… The fast search… If your requirement demands that you must perform too many searches for words… Then you must think of a trie tree… You don’t construct a trie tree if you want to search only once…. So it is all in the number of operations you want to perform… Let’s say, if you wanted to perform a 100 or more searches for a word.. Then go for a trie tree..! Let me know if you have any more doubts.. 🙂

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s