The Forest

Created: March 17, 2021

This Python code has been deprecated as of URS v3.4.0 and has been rewritten in Rust. However, the concepts discussed in this document as well as the implementation are still applicable to the Rust rewrite.

See Speeding Up Python with Rust for details on how I rewrote this code in Rust and how it yielded drastic performance improvements if you are interested in learning more.

Table of Contents

Introduction

Motivation

I am a self-taught software developer who just recently graduated from college and am currently looking for my first full-time job. I do not have a computer science degree, so I have had to teach myself a ton of concepts that I would have learned if I got the degree. A class I wish I was able to take in college is data structures and algorithms because that seems to be all the buzz when it comes to the technical interview, which I unfortunately struggle with greatly due to my lack of experience and practice.

Recently (March 2021) I have been teaching myself DSA. Implementing simple examples of each topic within DSA was not so bad (I am currently working on a study guide/reference repository containing these implementations in both Python and Rust that I will make public soon), but practicing Leetcode problems was and still is a difficult process for me. I will continue to power through the struggle because my livelihood and future career depends on it, though.

While it has not been a smooth journey, I have come to realize how useful DSA is and am implementing what I have learned in a real-world use case. I do not think I would have been able to figure out a solution to the structured comments scraper's prior shortcomings if I had not studied this area within computer science. I recently implemented my first trie and was fascinated by how abstract data structures worked. I immediately realized I needed to use a tree data structure for the structured comments scraper in order to take it to the next level, which is the purpose of this pull request.

Inspiration

The Forest is named after PRAW's CommentForest. The CommentForest does not return comments in structured format, so I wrote my own implementation of it.

The trie was a huge inspiration for the Forest. I will quickly explain my implementation of the trie node.

class TrieNode():
    def __init__(self, char, is_word):
        self.char = char
        self.is_word = is_word
        self.children = dict()

Each node of the trie contains a character, a boolean flag indicating whether the node denotes the end of a word, and holds a dictionary filled with child nodes as values and their respective characters as keys. I could have used an array and the indices within it to emulate a dictionary, but I figured I could save some access time at the cost of extra space.

Anyways, the trie implementation is very similar to how the Forest works.

How the Forest Works

The CommentNode

I created a class CommentNode to store each comment's metadata and replies:

class CommentNode():
    def __init__(self, metadata):
        for key, value in metadata.items():
            self.__setattr__(key, value)

        self.replies = []

I used __setattr__() because the root node defers from the standard comment node schema. By using __setattr__(), CommentNode attributes will be dynamically set based on the metadata dictionary that has been passed in. self.replies holds additional CommentNodes.

The Forest

Next, I created a class Forest which holds the root node and includes methods for insertion.

The Root Node

First, let's go over the root node.

class Forest():
    def __init__(self):
        self.root = CommentNode({ "id": "abc123" })

The only key in the dictionary passed into CommentNode is id, therefore the root CommentNode will only contain the attributes self.id and self.replies. A mock submission ID is shown. The actual source code will pull the submission's ID based on the URL that was passed into the -c flag and set the id value accordingly.

Before I get to the insertion methods, I will explain how comments and their replies are linked.

How PRAW Comments Are Linked

PRAW returns all submission comments by level order. This means all top levels are returned first, followed by all second-level replies, then third, so on and so forth.

I will create some mock comment objects to demonstrate. Here is a top level comment corresponding to the mock submisssion ID. Note the parent_id contains the submission's id, which is stored in self.root.id:

{
  "author": "u/asdfasdfasdfasdf",
  "body": "A top level comment here.",
  "created_utc": "06-06-2006 06:06:06",
  "distinguished": null,
  "edited": false,
  "id": "qwerty1",
  "is_submitter": false,
  "link_id": "t3_asdfgh",
  "parent_id": "t3_abc123",
  "score": 666,
  "stickied": false
}

Here is a second-level reply to the top comment. Note the parent_id contains the top comment's id:

{
  "author": "u/hjklhjklhjklhjkl",
  "body": "A reply here.",
  "created_utc": "06-06-2006 18:06:06",
  "distinguished": null,
  "edited": false,
  "id": "hjkl234",
  "is_submitter": true,
  "link_id": "t3_1a2b3c",
  "parent_id": "t1_qwerty1",
  "score": 6,
  "stickied": false
}

This pattern continues all the way down to the last level of comments. It is now very easy to link the correct comments together. I do this by calling split("_", 1) on the parent_id and then getting the second item in the split list to compare values. I also specify the maxsplit parameter to force one split.

The Insertion Methods

I then defined the methods for CommentNode insertion.

    def _dfs_insert(self, new_comment):
        stack = []
        stack.append(self.root)

        visited = set()
        visited.add(self.root)

        found = False
        while not found:
            current_comment = stack.pop(0)

            for reply in current_comment.replies:
                if new_comment.parent_id.split("_", 1)[1] == reply.id:
                    reply.replies.append(new_comment)
                    found = True
                else:
                    if reply not in visited:
                        stack.insert(0, reply)
                        visited.add(reply)

    def seed(self, new_comment):
        parent_id = new_comment.parent_id.split("_", 1)[1]

        self.root.replies.append(new_comment) \
            if parent_id == getattr(self.root, "id") \
            else self._dfs_insert(new_comment)

I implemented the depth-first search algorithm to find a comment's parent node and insert it into the parent node's replies array. I defined a separate visited set to keep track of visited CommentNodes to avoid an infinite loop of inserting CommentNodes that were already visited into the stack. At first I wrote a recursive version of depth-first search, but then opted for an iterative version because it would not scale well for submissions that included large amounts of comments, ie. stack overflow.

Within the seed method, I first check if the CommentNode is a top level comment by comparing its parent ID to the submission ID. Depth-first search is triggered if the CommentNode is not a top level comment.

Serializing the Forest

Since Python's built-in JSON module can only handle primitive types that have a direct JSON equivalent, a custom encoder is necessary to convert the Forest into JSON format. I defined this in Export.py.

from json import JSONEncoder

class EncodeNode(JSONEncoder):
    def default(self, object):
        return object.__dict__

The default() method overrides JSONEncoder's default() method and serializes the CommentNode by converting it into a dictionary, which is a primitive type that has a direct JSON equivalent:

EncodeNode().encode(CommentNode)

This ensures the node is correctly encoded before I call the seed() method to insert a new CommentNode into the replies arrays of its respective parent CommentNode.

I can then use this custom JSONEncoder subclass while exporting by specifying it within json.dump() with the cls kwarg:

with open(filename, "w", encoding = "utf-8") as results:
    json.dump(data, results, indent = 4, cls = EncodeNode)

This was how the structured comments export was implemented. Refer to the source code located in urs/praw_scrapers/Comments.py to see more. I hope this was somewhat interesting and/or informative. Thanks for reading!