Writing a Quick and Easy Thread-Monitor (Watchdog) in Python [shorts #1]

A thread-monitor, often also referred to as a watchdog, is extremely helpful when building multi-threaded and reliable applications. In its simplest form, a watchdog should detect when one or more threads hang or crash, and it should restart the problematic threads if necessary. Depending on your use-case, you could implement this helper in a variety of ways, and you could add many more features such as a heartbeat function that allows each thread to report its progress to the monitor.

Writing a custom worker thread using Python

As with pretty much everything in Python, there’s no on-size-fits-all solution to creating threads. I decided to write a new class that is a subclass of threading.Thread:

import time
from threading import Thread


class CustomWorker(Thread):

    def __init__(self, startValue):
        super().__init__()
        self.stopped = False
        self.counter = startValue

    def run(self):
        while self.counter > 0 and not self.stopped:
            print(self.name + ": Remaining tasks: " + str(self.counter))
            self.counter -= 1
            time.sleep(0.5)

As you can see, the custom Thread has a very simple task. The caller supplies it with a starting value and the worker counts down that value roughly twice a second. When the counter reaches zero, the thread stops. The watchdog should detect when this happens, and it should then restart the worker.

Figure 1: The output of this watchdog example. Note how the watchdog notices that the first thread stopped working, and the watchdog then started a new one. Also note the parallel nature of the program (The output is not necessarily in the correct order). This behavior is normal, and you’d require to add some sort of synchronization mechanism to fix it.

Note that this is a very simple example. In reality, the thread would most likely not end like this. Instead, it would most likely crash due to a fault-condition such as an unhandled exception. Either way, note that I also added a variable that allows the watchdog to gracefully shut the thread down. The thread-monitor can instruct a worker to quit by setting the worker’s stopped variable to false.

Creating a watchdog that monitors the custom threads

Now to the thread-monitor itself. As noted above, this is a simple implementation that spawns a single thread and periodically checks whether that thread is still running. If the thread stopped (for whatever reason), the watchdog spawns a new thread:

# Simple Python watchdog that detects if a thread stopped (e.g., due to an error)
# and restarts the thread if necessary.

import time
from CustomWorker import CustomWorker

# Main entry point
# Start the watchdog here
if __name__ == '__main__':
    # The thread this watchdog controls
    t = False
    try:
        # Run the watchdog endlessly
        while True:
            # If t is False then the thread was either stopped or never started at all
            # Therefore, create a new Instance and assign it to t, then start the thread
            if not t:
                t = CustomWorker(5)
                t.start()
                print("Started the thread!")
            # Check whether t stopped
            if t and not t.is_alive():
                print("Thread is not running!")
                print("Restarting the thread...")
                t = False
            # If t is running, just wait and let the other threads work
            else:
                time.sleep(1.0)
    # Users can exit the watchdog by sending a keyboard interrupt (Ctrl + C)
    except KeyboardInterrupt:
        print("Stopping all worker threads...")
        wait_cycles = 0

        # If t is currently running, send a stop signal and wait for the
        # thread to finish.
        if t and t.is_alive():
            t.stopped = True

        # Make sure that your custom threads don't block, otherwise the
        # watchdog will never exit
        while t.is_alive():
            print("Waiting for a worker to stop...")
            time.sleep(0.5)

        print("Stopped all workers! Stopping the watchdog...")

I’d like to specifically draw your attention to line 30. Here, the watchdog exits when users send a keyboard interrupt. Before it does that, the watchdog notifies the worker thread that it should stop. Then, the thread-monitor waits for the threads to finish. Here, it’s important that the threads don’t block (e.g., due to file access, dead-locks, etc.). A more sophisticated watchdog could count the number of times it has waited for a thread and force-quit itself (and all child threads) after a certain threshold.

Download the source code

You can download the source code from this GitHub repository. Feel free to share the code as you like, but please share this article along with the code if it helped you!

Tips and tricks

Make sure the watchdog is as simple as possible. There should be little to no chance that the watchdog thread itself halts, crashes, or blocks under normal circumstances.

Make sure that the worker threads are non-blocking or at least provide a way to stop them gracefully (e.g., using events).

This is a simple implementation, you can make it as complex as you wish. I recommend to keep it simple, though.

Leave your two cents, comment here!

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.