I’ve been working on some project management tools recently. The idea is to manage multiple projects which follow the same pattern – the same set of tasks, but re-used.

Data-model wise, I have 3 main tables – “Task”, which represents the tasks shared between all projects, “Job” which represents one particular project-task pairing, and “Task_dependency”, which represents dependencies between tasks.

An example would be building houses – “Construct Roof” might be one task, while “Construct the roof of house number 7″ would be a job matching that task to a single project. Meanwhile, there might be an entry in “Task_dependency” stating that the task “Construct Roof” depends on the task “Build Walls”.

So I started working on some of the algorithms needed for common tasks. One of my first concerns was data integrity – using SQL alone it’s hard to impose a restraint like “no task can end up indirectly depending on itself”. The dependencies must form a Directed Acyclic Graph (DAG), with tasks as the “nodes” and dependencies as the “edges”. I decided that the best way to handle this was to have my application check any updates to the task_dependency table for loops, and the easiest way to do that was to start at the task whose dependencies are to be updated, and work backwards along the dependencies to see if we can find our way back to that task. We could work forwards instead, but I couldn’t see any benefits to either direction.

The more difficult part was writing an auto-scheduler. Given a fixed date (the start or finish date), and durations for each task, I needed to plan out a schedule that avoided any dependency conflicts. I spent a while trying to find ways to sort the tasks by dependency, and looking at SQL Common Table Expressions (CTE), and in particular the phrase “WITH RECURSIVE”, before eventually deciding that this wasn’t much help to me. The lack of support for aggregation and non-linear recursion meant that I’d end up trying to find every path from every start node to every node in the graph, and then selecting the longest one for each node. Perhaps unsurprisingly, PostgreSQL wasn’t managing to optimise that.

A few weeks after writing my own algorithm using TCL, I found a wikipedia article on topological sorting, which described almost exactly what I was trying to do, even though I’d never heard the phrase before. Since I was trying to find the longest path at the same time as sorting, my home-made algorithm doesn’t look exactly like either of the ones described, but in principle it’s similar to the breadth-first approach.

Once tasks are sorted topologically, working from a given date was comparatively easy. I’ve ended up working backwards from a chosen finish date (this being what the client required), and scheduling each task as late as possible without causing any conflicts. By iterating over the tasks in reverse topological order, I can ensure that each task has its start and finish date calculated before the tasks it depends on.

In addition to this, I’m sorting each task by the longest path from that task node to the finish (as far as possible while maintaining topological order), and by displaying tasks in this this order, tasks which are scheduled later tend to appear further down the screen.

If you’re working forward from a start date, scheduling each task as early as possible, I’d recommend displaying tasks in topological order first, and in order of longest path from start second.