30 minute read
Software systems are built from units provided by the language. Some languages have functions and data; others have classes and objects. Some group these into modules. Some allow all of these types, trying to be everything to everyone. Some have none of these.
Programming is combining the language's available units, both within a type of unit (functions call other functions) and between types of units (functions are grouped into modules). This article explores several of these structural relationships, each of which is structured as a graph. Experienced programmers are necessarily good at thinking about graphs, though they may not use the technical term "graph". (Note: no experience with graph theory is necessary here; we won't do any formal mathematical analysis and the relevant ideas will be introduced from scratch.)
We'll examine many specific aspects of software structure:
- Programming language syntax itself is arranged in a tree, which is a simple type of graph.
- Module relationships and function call relationships are arranged in a graph.
- Humans think about graphs constantly and intuitively, whether they're programmers or not.
- Some recursive functions are more confusing than others as a direct result of their call graphs.
- Changing code requires analyzing the call graph, even if we don't think of it in those terms.
- Most systems contain "ubermodules" that are referenced by most of the rest of the system.
- Ubermodules usually have more code churn and more lines of code than other modules. This makes them confusing: they're referenced from many places, are large, and are constantly changing.
- Packages, one of the highest level of software organization, also has a graph structure, with the same primary problem that we see within individual systems. We have to balance the size of packages (or modules, or functions) against the complexity of connections between packages (or modules, or functions). Different package ecosystems have taken wildly different approaches to this granularity problem.
- Programming requires us to see these graphs implicitly. By thinking about them explicitly, we gain a leg up on complexity.
Programming language structure
If we know at least one "curly brace language" like C or JavaScript, our minds recognize the structure of code like:
if (x == 0) {
f();
} else {
g();
}
Without conscious thought, we see the condition inside the parentheses, the true clause inside the first set of curly braces, the else clause in the second set of curly braces, and so on. These elements are arranged in a tree structure. That tree is what we recognize without thinking.
The if
keyword has three children: a conditional expression inside parentheses, a "true" branch inside curly braces, and a "false" branch also inside curly braces.
The ==
operator has two children: the compared values.
A function call, shown as "[call]" below, has one or more children: the called function is the first, and zero or more arguments are the rest.
By drawing these dependencies, we get a diagram representing the tree structure of the code.
A compiler's parser implements rules for extracting abstract syntax trees like this one from the code's text. Those rules are rigid; the compiler will never allow deviations from them. Still, an experienced programmer can reliably create valid syntax – 99.9% of the time, let's say, if they're being careful. If the syntax rules are rigid, but a programmer can reliably create text that follows them, then the programmer's mind must contain near-perfect analogues of the parsing rules. We understand the structure, even if the understanding happens automatically.
As we dig into software structure, we'll see many more diagrams like the one above – nodes connected by arrows. They'll depict functions calling each other, modules importing each other, and so on. These are structures that we understand without thinking about them consciously, just as we understand programming language syntax without writing the syntax trees explicitly. Our goal here is to bring software structure into conscious thought, filling in blind spots that we all have.
Code is mostly graphs
The design of a system lies in (1) its units and (2) the connections between those units.
For example, imagine an OrderHistory
module summarizing users' order histories.
It will relate User
s to Transaction
s, so it must reference both of them in some way:
Nodes (ovals) here represent classes or modules, and edges (arrows) represent function or method call relationships.
Some method in OrderHistory
calls a method in User
; some other method, or maybe even the same method, calls a method in Transaction
.
At this level of granularity, we don't care which methods they are; we're concerned only with the relationships between the modules.
The diagram above is a graph because it's made of nodes and edges. It's also a tree because each node has only one parent – one incoming arrow. Now we introduce a complication that prevents the graph from being a tree.
Suppose that User
has a latest_transaction_date
method calling into Transaction
directly; it doesn't use OrderHistory
.
Now, there are multiple paths from OrderHistory
to Transaction
: we can go directly, or we can go via User
.
We now abandon this example, though we'll return to module graphs later with a real-world example. To get a sense of the generality of graphs, we'll briefly analyze two more examples with exactly the same graph shape: a function call graph and an inheritance hierarchy.
Call graphs indicate which functions call which other functions; or, in object-oriented terminology, which methods call which other methods.
Perhaps we have a method is_allowed_to_access
calling is_admin
.
But both is_allowed_to_access
and is_admin
check the user's access rights, so they both call find_user
.
This gives us the call graph:
The shape is the same as what we saw for module relationships between User
, OrderHistory
, and Transaction
, but the meaning here is different: these are calls instead of module dependencies.
Design of modules and functions are different skills, but they do reuse our graph reasoning ability, whether consciously (as we've done here) or unconsciously (as in most everyday programming).
We'll examine one more example: inheritance hierarchies in object-oriented languages. Suppose that we're modeling hardware devices in Python. We have a USBMouse class inheriting from USBDevice.
class USBMouse(USBDevice):
...
This seems like a trivial inheritance graph with two nodes, where USBMouse
depends on USBDevice
.
There's a complication, though: all Python classes inherit from the base object
class.
With object
considered, the full inheritance graph has the same structure as the other examples above:
The visual orientation here is different to match convention: base classes are usually higher up in the diagram. Still, the structure is the same as our previous examples. We could move these nodes and edges around to look just like the previous examples without changing any of the connections.
Graph relationships exist at all levels of scale, from the data dependencies between expressions in a function up to the dependencies between publicly published packages. This is good news for us: learning about one aspect of software structure reinforces our ability to think about others. It also means that all programmers eventually learn to think about graph relationships, even if that skill remains entirely automatic and subconscious. Programming requires understanding which functions call which other functions, which modules import which other modules, and which classes inherit from which other classes. Getting better at programming means, in part, getting better at thinking through more of those relationships without jumping to the functions and modules, examining them one by one.
Before we move into more detailed examples, here's a list of several places where graphs show up in programming. Some of these will get their own sections; others are only mentioned here to show the breadth of graphs' impact on software structure.
- Module imports form a module dependency graph.
- Function calls form a function call graph. (Methods are special cases of functions. The two are equivalent at this level of analysis.)
- Class inheritance forms an inheritance graph. (It's usually called an inheritance hierarchy, but the hierarchy is a graph in any language with either multiple inheritance or mixins.)
- Package dependencies in a package manager like NPM, RubyGems, PyPI, etc. form a package dependency graph.
- The execution order of statements in a function (i.e., which lines can execute after which other lines) form a control flow graph.
- The flow of data through a program forms a data-flow graph.
Graphs in thought
Any system of entities and relationships is a type of a graph: friendships among people, course prerequisites at a university, or stations in a subway system. Graphs show up constantly in life. We rarely think about them consciously as graphs, but we've very good at thinking about them implicitly, just as we are for software structure.
To get from St. Paul's Cathedral to Regent's Park in London, you can take the Central tube line to Oxford Circus and change to the Bakerloo line. If service interruptions prevent the Central line from stopping at Oxford Circus, the Piccadilly line can be used as a detour at the cost of a second change.
The two sentences above describe a small section of the graph composed of London tube stations (nodes) and subway lines (edges):
The modern London tube map prioritizes the transit network's abstract structure over the physical layout of the city, just as the above diagram does. Relative distances on the map don't necessarily correspond to physical distance between stations, curves on the map don't necessarily correspond to curves in the physical tracks, and so on.
Scans of London tube maps before (1908) and after (1933) Beck's redesign are shown below. The orginal maps were confusing because complications of the physical world were shown: physically-close stations were drawn close together, the lines were shown to curve as they do in reality, etc. Beck's redesign shows logical structure rather than physical structure, allowing us to plot a course more easily. We don't care whether the Central line's tracks curve or not, and we rarely care about the walking distance between stations; we care about how they connect.