30 minute read

Software systems are built from units provided by the language. Some languages have functions and data; others have classes and objects. Some group these into modules. Some allow all of these types, trying to be everything to everyone. Some have none of these.

Programming is combining the language's available units, both within a type of unit (functions call other functions) and between types of units (functions are grouped into modules). This article explores several of these structural relationships, each of which is structured as a graph. Experienced programmers are necessarily good at thinking about graphs, though they may not use the technical term "graph". (Note: no experience with graph theory is necessary here; we won't do any formal mathematical analysis and the relevant ideas will be introduced from scratch.)

We'll examine many specific aspects of software structure:

  • Programming language syntax itself is arranged in a tree, which is a simple type of graph.
  • Module relationships and function call relationships are arranged in a graph.
  • Humans think about graphs constantly and intuitively, whether they're programmers or not.
  • Some recursive functions are more confusing than others as a direct result of their call graphs.
  • Changing code requires analyzing the call graph, even if we don't think of it in those terms.
  • Most systems contain "ubermodules" that are referenced by most of the rest of the system.
  • Ubermodules usually have more code churn and more lines of code than other modules. This makes them confusing: they're referenced from many places, are large, and are constantly changing.
  • Packages, one of the highest level of software organization, also has a graph structure, with the same primary problem that we see within individual systems. We have to balance the size of packages (or modules, or functions) against the complexity of connections between packages (or modules, or functions). Different package ecosystems have taken wildly different approaches to this granularity problem.
  • Programming requires us to see these graphs implicitly. By thinking about them explicitly, we gain a leg up on complexity.

Programming language structure

If we know at least one "curly brace language" like C or JavaScript, our minds recognize the structure of code like:

if (x == 0) {
  f();
} else {
  g();
}

Without conscious thought, we see the condition inside the parentheses, the true clause inside the first set of curly braces, the else clause in the second set of curly braces, and so on. These elements are arranged in a tree structure. That tree is what we recognize without thinking.

The if keyword has three children: a conditional expression inside parentheses, a "true" branch inside curly braces, and a "false" branch also inside curly braces. The == operator has two children: the compared values. A function call, shown as "[call]" below, has one or more children: the called function is the first, and zero or more arguments are the rest. By drawing these dependencies, we get a diagram representing the tree structure of the code.

A compiler's parser implements rules for extracting abstract syntax trees like this one from the code's text. Those rules are rigid; the compiler will never allow deviations from them. Still, an experienced programmer can reliably create valid syntax – 99.9% of the time, let's say, if they're being careful. If the syntax rules are rigid, but a programmer can reliably create text that follows them, then the programmer's mind must contain near-perfect analogues of the parsing rules. We understand the structure, even if the understanding happens automatically.

As we dig into software structure, we'll see many more diagrams like the one above – nodes connected by arrows. They'll depict functions calling each other, modules importing each other, and so on. These are structures that we understand without thinking about them consciously, just as we understand programming language syntax without writing the syntax trees explicitly. Our goal here is to bring software structure into conscious thought, filling in blind spots that we all have.

Code is mostly graphs

The design of a system lies in (1) its units and (2) the connections between those units. For example, imagine an OrderHistory module summarizing users' order histories. It will relate Users to Transactions, so it must reference both of them in some way:

Nodes (ovals) here represent classes or modules, and edges (arrows) represent function or method call relationships. Some method in OrderHistory calls a method in User; some other method, or maybe even the same method, calls a method in Transaction. At this level of granularity, we don't care which methods they are; we're concerned only with the relationships between the modules.

The diagram above is a graph because it's made of nodes and edges. It's also a tree because each node has only one parent – one incoming arrow. Now we introduce a complication that prevents the graph from being a tree.

Suppose that User has a latest_transaction_date method calling into Transaction directly; it doesn't use OrderHistory. Now, there are multiple paths from OrderHistory to Transaction: we can go directly, or we can go via User.

We now abandon this example, though we'll return to module graphs later with a real-world example. To get a sense of the generality of graphs, we'll briefly analyze two more examples with exactly the same graph shape: a function call graph and an inheritance hierarchy.

Call graphs indicate which functions call which other functions; or, in object-oriented terminology, which methods call which other methods. Perhaps we have a method is_allowed_to_access calling is_admin. But both is_allowed_to_access and is_admin check the user's access rights, so they both call find_user. This gives us the call graph:

The shape is the same as what we saw for module relationships between User, OrderHistory, and Transaction, but the meaning here is different: these are calls instead of module dependencies. Design of modules and functions are different skills, but they do reuse our graph reasoning ability, whether consciously (as we've done here) or unconsciously (as in most everyday programming).

We'll examine one more example: inheritance hierarchies in object-oriented languages. Suppose that we're modeling hardware devices in Python. We have a USBMouse class inheriting from USBDevice.

class USBMouse(USBDevice):
    ...

This seems like a trivial inheritance graph with two nodes, where USBMouse depends on USBDevice. There's a complication, though: all Python classes inherit from the base object class. With object considered, the full inheritance graph has the same structure as the other examples above:

The visual orientation here is different to match convention: base classes are usually higher up in the diagram. Still, the structure is the same as our previous examples. We could move these nodes and edges around to look just like the previous examples without changing any of the connections.

Graph relationships exist at all levels of scale, from the data dependencies between expressions in a function up to the dependencies between publicly published packages. This is good news for us: learning about one aspect of software structure reinforces our ability to think about others. It also means that all programmers eventually learn to think about graph relationships, even if that skill remains entirely automatic and subconscious. Programming requires understanding which functions call which other functions, which modules import which other modules, and which classes inherit from which other classes. Getting better at programming means, in part, getting better at thinking through more of those relationships without jumping to the functions and modules, examining them one by one.

Before we move into more detailed examples, here's a list of several places where graphs show up in programming. Some of these will get their own sections; others are only mentioned here to show the breadth of graphs' impact on software structure.

  • Module imports form a module dependency graph.
  • Function calls form a function call graph. (Methods are special cases of functions. The two are equivalent at this level of analysis.)
  • Class inheritance forms an inheritance graph. (It's usually called an inheritance hierarchy, but the hierarchy is a graph in any language with either multiple inheritance or mixins.)
  • Package dependencies in a package manager like NPM, RubyGems, PyPI, etc. form a package dependency graph.
  • The execution order of statements in a function (i.e., which lines can execute after which other lines) form a control flow graph.
  • The flow of data through a program forms a data-flow graph.

Graphs in thought

Any system of entities and relationships is a type of a graph: friendships among people, course prerequisites at a university, or stations in a subway system. Graphs show up constantly in life. We rarely think about them consciously as graphs, but we've very good at thinking about them implicitly, just as we are for software structure.

To get from St. Paul's Cathedral to Regent's Park in London, you can take the Central tube line to Oxford Circus and change to the Bakerloo line. If service interruptions prevent the Central line from stopping at Oxford Circus, the Piccadilly line can be used as a detour at the cost of a second change.

The two sentences above describe a small section of the graph composed of London tube stations (nodes) and subway lines (edges):

The modern London tube map prioritizes the transit network's abstract structure over the physical layout of the city, just as the above diagram does. Relative distances on the map don't necessarily correspond to physical distance between stations, curves on the map don't necessarily correspond to curves in the physical tracks, and so on.

Scans of London tube maps before (1908) and after (1933) Beck's redesign are shown below. The orginal maps were confusing because complications of the physical world were shown: physically-close stations were drawn close together, the lines were shown to curve as they do in reality, etc. Beck's redesign shows logical structure rather than physical structure, allowing us to plot a course more easily. We don't care whether the Central line's tracks curve or not, and we rarely care about the walking distance between stations; we care about how they connect.

A first-time visitor will follow the graph literally, tracing different connections. They might even look at a physical map of the city, trying to reconcile the tube and street maps. (Don't do it; it will confuse you!)

With time, we internalize abstract graphs like the Tube's, just as we internalize programming language syntax trees. A long-time programmer generates syntax with few errors; a long-time Londoner knows how to get from one station to another, even with reroutes cropping up.

Both types of understanding are important in software, transit networks, and elsewhere. We have to write functions and to walk down streets. But when considering a software system as a whole, or when moving across large distances in London, fine-grained details are distractions. We need the abstract maps to mitigate our finite minds.

To understand software systems, programmers commonly think at four different levels of abstraction. First, the code in all of its complexity: function bodies, statements, expressions, and so on. Second, the call graph, which discards the functions' internals. Third, the module interaction graph, which discards individual functions entirely. Fourth, the package dependency graph, which discards all internal structure of the packages, and is shared by all users of a language.

The next several sections will dig into each of those in detail. We'll find similar graph structures in each, but we'll also find nuanced differences that stop us from making tempting, but naive, generalizations.

The call graph

Call graphs are the most familiar of software's many graphs. Here, nodes represent functions and edges represent calls between those functions. At this level of analysis, we don't care which line of function A calls function B; that's abstracted away. We only care that the call happens somewhere inside of A.

As an example, Underscore.js defines its first function as shown below. (If the details of these functions' internals aren't interesting, skipping them is OK.) We're concerned only with the structural relationships here.)

_.first = function(array, n, guard) {
  if (array == null) return void 0;
  if (n == null || guard) return array[0];
  return _.initial(array, array.length - n);
};

This references the function initial, which returns a list with the last n elements removed.

_.initial = function(array, n, guard) {
  return slice.call(array, 0, Math.max(0, array.length - (n == null || guard ? 1 : n)));
};

initial in turn references slice and Math.max. We can visualize this small piece of underscore as a call graph:

This graph is acyclic: we can't start at a node, follow a series of edges, and end up back at the node that we started at. Acyclic call graphs are desirable because they let us put function calls in a sensible order. For example, in this example we can think of first as the entry point; it sets off the series of calls to other functions. We can abstract those details away if we like, thinking of first as a black box, ignoring its call to initial which calls yet more functions.

Things are more difficult when we consider call graph cycles, which correspond to recursive function calls. For example, the recursive factorial function calls itself:

function fact(n) {
  if (n == 0) {
    return 1;
  } else {
    return n * fact(n - 1);
  }
}

Its call graph is the smallest possible cyclic graph. We can begin at fact, follow the only available edge, and end up back at fact, forming a cycle:

Cyclic call graphs become confusing quickly, mirroring the confusion of recursion in general. With many calls between the beginning and end of the recursive call relationship, we might not even notice the recursion. In this example, we may miss the recursion until we examine all of g, h, i, and j, which calls back to g.

This is why recursion involving multiple functions is confusing: the recursion is only visible after we trace through several calls. Our graph diagram above shows the recursion clearly, but that same recursion is invisible when looking at any given function in the code.

Reasoning about call graphs

Every program's call graph is different, so we don't memorize them like we memorize, say, language syntax. However, conscious thought about graphs is useful, especially in confusing edge cases like recursion. How many edges – calls – are there in the recursive loop? One is fine if it's justified, as in fact. Two might throw people off. With three, we should add a signpost comment warning future developers that something unusual is happening.

We also rarely think about call graphs consciously during maintenance, but we perform maintenance according to the graph even if we don't realize it. Suppose that we want to change Underscore's initial function, for example. Changing initial might break first, which uses it. We examine first: did its behavior change when we changed initial? Maybe we find that its behavior did change, and that the change is desirable – it fixed a bug. Now, we have to ask the same question again: which functions are affected by this change to first's behavior? Are any such changes desirable? And so on, expanding outward through the call graph.

The fancy computer science term for this is breadth-first search. We change the function; then we examine each of the functions that used it, noting the ones whose behavior changed; then we examine functions affected by those changes; and so on. We're following edges "backwards" here: after changing initial, we notice that first calls it – points to it in the graph – so we examine first as well.

We rarely think "I will now do a backwards breadth-first search on the call graph"; and yet, that's exactly what we find ourselves doing. If we don't do this, we'll sometimes change functions' behavior accidentally, introducing bugs.

Tracking these changes is a big maintenance burden, but thinking about the graph can reduce the cost. For example, if we avoid writing functions called from 500 different places, we won't find ourselves checking 500 call sites for impacts from a change. Or, to be more honest about software maintenance: we won't find ourselves avoiding ever changing that function. Experienced programmers intuitively distrust complex functions called from 500 places: complex functions tend to change, and a change puts all 500 of those call sites in question.

Many other call graph structures suggest danger. We've seen two: cycles across many nodes, forming a non-obvious recursive loop; and complex functions with many incoming edges – 500, say. More exist, but cataloging them is tiresome and probably not valuable.

Instead of trying to memorize a catalog of graph shapes, it's better to just keep an eye on the call graph, and on other graphs that we'll see soon. When a particular change is difficult, ask: what's the structure here? Is there some particular notable shape, or is it a just a boring tree of calls or modules? Over time, certain shapes will recur; patterns will become clear. Noticing them early can save frustration and maintenance costs later.

Module interaction graphs

Call graphs are important, but they become unwieldy beyond a few dozen functions or so. When tracking the call graph in our heads, it becomes unwieldy even faster. To analyze large-scale design properties, we need to abstract further than the function level. Instead of asking "which functions calls which other functions?", we can ask "which modules call functions in which other modules?" This gives us a module interaction graph.

As before, we'll ignore the specific internal details of the modules. For the function call graph, we didn't note which line of code the call came from. For the module interaction graph, we won't note which specific function in module A called which function in module B.

Our example will be rubygems.org, the website of Ruby's package repository. It's 3,365 lines of Rails code, which is big enough to show real-world complications but still small enough to be tractable as an example. There are 389 method definitions, and many more method call points than that, making the call graph too large for direct consideration.

Rubygems.org's is shown below, where an edge from A to B indicates a call from some function in module A to some function in B. There's a lot of complexity here, so here's a suggested goal: scan through the diagram, looking at the number of edges coming into and out of a given node. Ask yourself: which modules are most heavily connected to other modules?

(Now, some notes on this graph for the curious. These can be skipped safely. The diagram was generated by using Ruby's set_trace_func to observe method calls during a full run of the test suite. rubygems.org has good test coverage, so this should present a good approximation of the true runtime interactions in production. A minority of tests failed for environmental reasons, but that will have only a small effect on this graph. The results are limited to function call relationships between modules; inheritance isn't considered. About ten nodes not connected to the main graph were removed. One node, Redirector, was removed because it referenced every controller and caused layout trouble. None of the removed nodes had significant design implications. Revision ec03f00 of the RubyGems.org repository was used, limited to the app directory, which contains 3,365 lines of Ruby.)

The most prominent features of this graph are three ubermodules with disproportionate incoming edges: User, Version, and Rubygem. User classes in particular are often highly connected to other modules; most web applications have one like this. This makes intuitive sense: most functionality concerns users in some way, making User an easy dumping ground for application code. (Sometimes, users are represented by both a User and a Profile, in which case the latter usually becomes the dumping ground.)

The other two ubermodules, Rubygem and Version, represent a Ruby package and a version of a package, respectively. Rubygems.org allows users to publish versions of Ruby gems, so it's no surprise that User, Rubygem, and Version are referenced frequently. Most web applications will have a handfull of heavily-connected, domain-specific classes like this. The number varies, but few applications have none.

Structure of ubermodules

Ubermodules in application designs get two main types of reactions. One reaction is: it's fine; the application is about users uploading versions of Ruby gems, so User and Rubygem and Version should be referenced everywhere. This is a common sense argument and apt in many cases.

Small applications and some medium-size applications – 5,000 lines of Ruby, say – do fine with this kind of design. Many 2,000-line applications could even be written as a single long source file without suffering much. In a simple system, there's just not enough complexity to cause big problems.

More complex systems are a different story, with the argument against ubermodules growing as the system grows. Seeing why requires us to dig into other structural properties of the system.

From experience, I can confidently predict two facts about rubygems.org. First, all three of User, Version, and Rubygem will be among the top five most-changed source files. Second, the same will be true for line length: these three files will be among the five longest modules.

We'll check my predictions with some shell commands. The first lists source files along with the number of commits that have changed them, printing the top five most-changed files. (It's OK to skip the command itself if it's not interesting. If you do read it, note that > lines are continuations of the command to fit it on the screen.)

$ find app -name '*.rb' |
> while read f; do
>   echo "$(git log --oneline $f | wc -l) $f"
> done |
> sort -n |
> tail -5
      87 app/models/pusher.rb
      92 app/helpers/rubygems_helper.rb
     111 app/models/user.rb
     235 app/models/version.rb
     300 app/models/rubygem.rb

My "in the top five" guess was too conservative! The three most heavily-referenced modules in the module graph – User, Version, and Rubygem – are also the three most frequently changed source files. Now for the second prediction: the ubermodules will be among the longest source files. This command lists source files and their length, limited to the top five.

$ find app -name '*.rb' |
> while read x; do
>   wc -l $x
> done |
> sort -n |
> tail -5
137 app/models/pusher.rb
142 app/models/concerns/rubygem_searchable.rb
159 app/models/user.rb
318 app/models/rubygem.rb
372 app/models/version.rb

Again, I was too conservative: the three ubermodules are also the longest in the entire application.

These two facts will be true in most application; it seems like a natural law of software development. In software systems, the most highly-connected modules are also the most-frequently-changed and the longest in terms of lines.

Arguments against ubermodules

Now we can analyze the argument against this kind of design for larger systems. As the system grows in size, the design trends will continue. Ubermodules like User will grow even more incoming references, change even more frequently, and grow even longer in lines.

Suppose that, ten years from now, rubygems.org is ten times larger. Instead of 25 methods in 159 lines of code, User now has 250 methods in 1,590 lines of code. There are now 90 modules referencing User instead of 9.

(Two quick notes about this imagined system. First, 250 methods across 1,590 lines of code is conservative; ask friends who maintain large legacy Rails apps about their User or Profile classes! Second, we're naively assuming linear growth of User's connectedness, change frequency, and length in lines. In reality, that growth is probably superlinear, but the linear assumption is sufficient for the points made here.)

Changing any method requires analyzing impact on the rest of the system, ideally done in two stages. First, we ask "which functions in this module call the changed function?" We examine each of those functions in case their behavior was affected. Then, we ask "which modules use this module?". Again, we examine each module that might be affected by the change.

This process breaks down with our typical lop-sided designs. In our case, there are no module boundaries between User's 250 methods, so changing one of them puts them all in question. We also have 90 incoming modules to consider: did our change break any of them? We don't get the full benefit of modules because each change to User puts so much of the system in question.

As the system grows, ubermodules grow more and more connected to the rest of the system, while their change frequency simultaneously grows. In practice, this causes maintenance to break down. We stop asking "what other functions do I have to change for this to work?" because there are just too many to consider.

The methods in User begin to diverge, growing inconsistencies and special cases. Some raise exceptions for errors, but others return nil, because one or the other was expedient at some point in the past. Some assume that necessary database records already exist, but others implicitly create them when they don't exist.

Inconsistency makes us uncertain, so we begin to program defensively, adding paranoid checks. Many methods try to catch exceptions that may never be thrown, because we're afraid of such an exception taking the system down. Other methods check for a database record's existence, only saving it if needed, instead of consistently saving or not saving. And so on. Over time, User becomes not only the largest and most-changed module, but also the module that we're most afraid to change.

(Again, if this isn't familiar, and you know someone who works on a legacy Rails app, ask them about their User or Profile class! Better yet, sit down with them and search User or Profile for the keyword if, asking exactly why each conditional exists.)

Improving the module graph

Ubermodules like User are a nearly universal disease of the graph, manifesting as a vague fear about our changes. It's hard to see the problem in any more detail while looking at a text editor; we just can't see enough of the system at one time.

Stepping back to the module graph clarifies these phenomena. How complex is the call graph within a module? How heavily connected is the module interaction graph? Are there nodes in these graphs that are disproportionately large, or disproportionately connected, or both?

Uberclasses indicate lop-sided design graphs. They have too much internal complexity (250 methods in our imagined rubygems.org of the future). They're referenced by too many other modules (90 incoming edges). They change too frequently (#1 in terms of commits modifying them). They contain too much text (#1 in terms of lines of code). All of these properties are highly correlated, with multiplicative effects on our confusion.

How do we fix this? Returning to an earlier example: suppose that we extract OrderHistory from User in a standard ecommerce site. Now, any module accessing order history loses its module interaction edge to User. When order history code changes, we don't have to ask which of the 90 modules referencing User might be affected. Only a few modules will care about order history, so our change analysis is limited to those.

With small modules like OrderHistory, we can easily make behavioral guarantees. We might decide that OrderHistory is read-only; it won't write to the database. Or, we might decide that its methods always raise exceptions on errors; they never return nil. Now we don't need to guess about those properties when calling OrderHistory from the outside, even though we may choose other constraints for other modules. We can be more confident and less defensive when calling methods, both internally within OrderHistory and from the outside. Repeat that a few times and the system starts to feel more maintainable.

The solution, in short, is simple on its face: break the large modules down into smaller pieces, and make the methods in those small modules similar to each other. In practice, this is subtle and difficult. However, it's important to avoid worrying over which module to extract first. In advanced ubermodule cases, extracting almost any module can be beneficial, even if it's not the ideal extraction. The first step is to introduce some kind of module boundary. We can always move the methods around later.

The package dependency graph

Most programming languages have some kind of package manager. Our example will be RubyGems, which manages Ruby packages called gems.

Among many other features, RubyGems allows gems to specify dependencies. Gems can depend on other gems depending on yet more gems, just as functions can depend on other functions depending on yet more functions. As with functions, this forms a graph – the package dependency graph.

We'll use the Rails gem as an example to explore packaging. It's a large and stable package, with a nontrivial but still tractable dependency graph.

As this article is written, the current version of Rails is 5.0.2. There are 11 explicit dependencies, 10 of which are owned by the Rails organization itself. Many of those packages have their own dependencies, with 38 total packages in the dependency graph.

Rails 5.0.2's full dependency graph is shown below. Nodes with solid edges are gems owned by the Rails organization on GitHub; dashed nodes are third-party gems. There's no need to read through this exhaustively, but do make note of the one node whose connectedness stands out.

The activesupport gem is a clear outlier in this dependency graph, like User is in most web applications' module graphs. Rails developers will know the reason: activesupport is a junk drawer of useful utilities, mostly extensions to the Ruby standard library. It's heavily used by Rails itself, applications using Rails, and even other libraries unrelated to Rails.

For the internal design of systems, an ubernode like this is widely considered bad design. In packaging, the situation is less clear; there's little consensus about any property of the package graph.

Package granularity

During their rise to popularity, the Ruby and Rails ecosystems were thought to take package decomposition too far. Both Rails users and Rails detractors would lament projects depending on 100 or 200 or, for large applications, perhaps 500 packages. Ten years later, react-starter-kit, a JavaScript application starter kit, has 1,090 dependencies but no actual user-facing behavior. By comparison, a newly generated Rails 4.2.8 project depends on 55 packages: the 38 shown in the graph above, plus 17 more included as default, but optional, dependencies.

This isn't exactly an apples to apples comparison; no comparison between large projects can be. Still, it's illustrative of the trend: no project in the Ruby world has 1,090 dependencies, and even simple Node modules have more dependencies than all of Rails combined. Package granularity is unambiguously different between these two ecosystems.

It's difficult to draw any conclusion about package granularity from where we stand now. Maybe it will keep increasing. Rails project's 100 or 500 packages were once ridiculed but are now conservative. Will a React project's 1,000 or even 5,000 packages eventually be uncontentious or even conservative? Maybe, but there are reasons to be skeptical.

If a dependency becomes unmaintained or introduces breaking changes, then Rails and its many users are impacted. As a result, Rails' maintainers have to keep most or all of the dependency graph shown above in their heads. Likewise, the maintainers of Rails' dependencies know at least one small corner of the Rails dependency graph: they know that Rails depends on them, so they usually avoid sudden changes that will break Rails.

There have been hints of long-term dependency problems even in the Rails of today. For example, Rails' asset pipeline is implemented by the Sprockets Ruby gem. The main developer of Sprockets was responsible for more than 70% of commits, but left abruptly in 2016. Rails also depends on erubis, an implementation of the ERB templating language whose last commit was six years ago as of 2017.

If these projects were internal to Rails, they would presumably have more contributors and less susceptibility to abandonment. Rails' authors saved time up front by reusing outside packages, but they also exposed themselves to package abandonment risk. Increasing package decomposition, as in Node, increases this abandonment risk; there are simply more maintainers involved. A larger dependency graph also means more dependencies of dependencies – transitive dependencies – so there may be a third-party package "between" us and an unmaintained package.

This raises a question: can't we just fork the unmaintained packages? Yes, strictly speaking, but the devil is in the details. What if a dependency of a dependency of a dependency has a severe security bug, but multiple packages between us and that package are unmaintained? Do we fork them all? If so, we have to learn several packages' guts, each in its author's own style. We have to fork them all, modifying each to use our fork of the next package in the chain.

If we do fork and publish our own security-fixed chain of these packages, other people will begin using them as soon as we publish them. We just became maintainers of code that we don't actually know. If we eventually abandon those forks, we make the community's abandonment problem even worse. Now everyone who used our forked versions is relying on an unmaintained set of packages modified by someone who didn't understand them well.

In short, "just fork it" is usually wishful thinking; it ignores the real complexities of the situation.

On the bright side, smaller packages certainly mean less impact from any given direct dependencies going unmaintained. We may be able to use that fact to mitigate the downsides of Node-style fine-grained packaging.

Suppose that we avoid deep dependencies chains, preferring shallow but wide dependency graphs. A wide and shallow graph means fewer packages between us and any other package, leaving fewer opportunities for distant unmaintained packages. There may be other pitfalls in that approach, and we may not even try it, but there are at least potential mitigations to the downsides of small packages.

Like everything else in software, package ecosystem design is primarily the act of balancing graph complexity at different scales. The big question right now is: how large should nodes be? Packaging is a globally social process, which complicates the issue, but some of the fundamental questions are the same as for functions or modules. How connected are nodes? How large are nodes? Are there ubernodes with inordinate numbers of incoming connections? Are there cycles, and how many nodes are involved? Etc.

The utility of graph analysis

Most properties of software systems are structured as graphs; those examined here are a small sample. When programming in a text editor, these graphs remain implicit. We see functions calling other functions, but we don't see the bigger picture unless we intentionally step back.

Thinking explicitly about graphs becomes more valuable as systems grow. That doesn't necesarily require automated analysis as done for many of the graph diagrams in this article; it can mean drawing out the relationships as you understand them, or even constructing the graph in your head.

Intentional analysis of the graph can lead to many insights, like:

  • Why do so many classes reference User? Are they relying on several discrete ideas erroneously buried within User, and should we extract those ideas into new modules? (That is, do the nodes represent clear concepts?)
  • Why do the arrows in this call graph go like initialize -> load -> analyze? Analyzing isn't part of loading, and loading isn't part of initializing, yet that's how it's implemented. Can we introduce a new function that calls these three in sequence, expressing that linear process better? (That is, do the edges represent relationships as we understand them, or are they the accidental results of historical changes?)
  • Why do modules referencing Transaction also reference User in so many cases? Is there a subset of User and Transaction's behaviors that can be extracted as, say, OrderHistory? (That is, do other modules connect to sets of related modules in ways that indicate missing abstractions?)

Graphs are omnipresent in software. Sometimes, thinking about them is undeniably important (module interaction graphs in medium to large systems). Sometimes, they may be overshadowed by external factors (social dynamics complicating package dependency ecosystems). Because they show up frequently, sometimes in critical ways, learning to think about them pays for itself over and over again. It shows us the structure in our systems, allowing us to understand them at multiple levels of meaning simultaneously.

This article is part of The Programmer's Compendium.

Previous article: Digital Electronics.

Next article: Network Protocols.