Test Isolation Is About Avoiding Mocks

Isolated testing has an easily identified villain: the deeply-nested mock object. Around 21:35 in this discussion, Kent Beck mentions code with "mocks returning mocks returning mocks" and their stifling effect on refactoring. Kent is right: nested mocks make refactoring and maintenance difficult. They're usually a bad idea.

The clarification that's missing from most discussions of mocking, including the one in that video, is that experienced users of mocks rarely nest them deeply. Avoiding numerous or deeply nested mocks is the principal design activity of isolated TDD. Since none of the people in the video isolate or use mocks heavily, it's unsurprising that no one brought this up.

Years ago, when I was new to all of this, I did some truly terrible mocking. There's even a Destroy All Software screencast where I show some of that code as a cautionary tale. The mocks were numerous, deep, and coupled deeply into objects' internals. Refactoring always meant rewriting many deep mocks in tests all over the system. I began my test isolation journey by doing it very, very badly.

I'll be merciful and leave that code out of this post; here's a much less painful method to think about. It computes the average value of all of a user's past transactions, which are related to the user via purchase records. (Some transactions, like refunds, don't correspond to a purchase, which is why there are two tables instead of one.)

def average_transaction_amount(user)
  purchases = user.purchases
  total = purchases.map(&:transaction).map(&:amount).inject(&:+)
  total / purchases.count
end

This method has deep knowledge about the database's structure. It knows that users have many purchases; that purchases have one transaction; and that transactions have amounts. In the absence of significant mitigating factors, this is not good design. Other parts of the system will inevitably need to access, for example, a purchase's transaction. This deep structural knowledge will be duplicated.

When the schema changes, there will be a significant maintenance cost. That cost will be paid in the form of effort spent updating the duplicated points, and in risk of missing one of the updates. (Database schemas are only slightly special; this situation wouldn't change significantly if we were coupled to plain old object relationships.)

To tackle one small piece of the problem, we can remove knowledge of the purchase-to-transaction relationship. We might add an amount method to the purchase, with that method delegating to the associated transaction:

class Purchase < ActiveRecord::Base
  delegate :amount, to: :transaction
end

With this change, code that needs a purchase's amount (like the average_transaction_amount method above) doesn't have to know of, and couple to, the transaction's existence. The map(&:transaction) call disappears from the call chain. Coupling is reduced.

This isn't motivated by some abstract desire to reduce the number of dots in a method. It's motivated by reduced maintenance costs: fewer updates to the code, and fewer potential mistakes, when the purchase-to-transaction relationship changes.

Similar extractions can be done to remove any of the steps in the original method chain while leaving the rest of the chain intact. The obvious refactoring to has_many :through is missing, but it's just another example of the same idea: it reduces coupling and improves the design. After hundreds of small adjustments like this, a system will be much easier to change.

When testing the average_transaction_amount method in isolation, we have very little flexibility in test structure; that's how isolation works. If we want to isolate strictly, removing all dependencies at test time, then we have to write something like:

it "computes average transaction amounts" do
  user = stub(:purchases => [
    stub(:transaction => stub(:amount => 40.0)),
    stub(:transaction => stub(:amount => 80.0)),
  ])
  average_transaction_amount(user).should == 60.0
end

My hope is that this is at least somewhat hard to follow. Despite being hard to follow, it's characteristic of tests that many people write. But is it good?

No. It's bad test design. Those nested stubs are telling us something about the method under test: it reaches deep into its user argument. The code under test can only traverse data that the test creates for it, so deep traversal of objects in the production code leads to deeply nested mocks in the tests. This is true even if the deep traversal isn't otherwise obvious.

In a hundred-line class with a dozen methods, the object traversal is often spread across many methods, each of which traverses only one level. That class is deeply, but invisibly, coupled to its collaborators. A glance at the isolated test tells us this, but getting that information from the code would require a slow, careful reading.

This is a large part of the claim that isolated tests drive better design. Because an isolated test must set up all collaborators as mocks, the only way to reduce the mock complexity is to reduce the depth of collaboration. If we mock three levels deep, and do nothing to reduce that mock depth, then we're spending the effort to isolate without getting the benefit. If, however, we change the design to reduce coupling, then the mock depth will also be reduced, and isolation will earn its keep.

Deeply nested mocks tell us little about mocks, just as deeply nested conditionals tell us little about structured programming. Proficient users of structured programming rarely write deeply nested conditionals; proficient users of mocks rarely write deeply nested mocks.

In both mocking and structured programming, highly nested structure is a visual indication of a potential design problem. In both cases, the problem is hard to see otherwise because it's only expressed in implicit relationships between non-sequential lines of code. Control flow complexity is hard to gauge when we have to trace GOTOs; coupling is hard to gauge when we have to trace calls. Structured programming clarifies control flow; isolated tests clarify coupling.

If the primary design benefit of isolated unit testing is the mocks' visualization of interactions, can isolated tests be obsoleted by sufficiently advanced analysis and visualization tools? This is equivalent to asking: Is merely seeing the design problem sufficient?

Consider function size. It's always visible at a glance. Most of us think that small functions are better, yet hundred-plus line functions are common. Even those of us who like small functions write large ones and regret it. Why don't we react to the clearly-visible size of these functions? Why don't we decompose them?

The reason is that there's no pressure exerted on us. Writing large functions or classes requires less typing than writing many small functions or classes. It's easy to be lazy and give in to the false economy of adding just one more line to a function. With a suite of integration tests, adding one more line to an existing function rarely matters to the test because the test sees few of the system's internal boundaries.

An isolated test for a large, complicated component requires much more effort than a test for a small one. I've been doing this for a long time, relatively speaking, but isolating a nontrivial fifty-line function with ten collaborators would be so annoying that I probably wouldn't even attempt it outside of extreme situations. The cognitive burden of isolation grows very quickly when a function has more than a few collaborators or a couple levels of chained method calls.

The cost of isolating complex code counteracts the desire to be lazy and avoid extracting a new method or a new class. Choosing to isolate is a conscious choice to augment our programming with disciplined structure and visualization, just as choosing structured programming is. It reigns in the size of functions and classes: it will frequently force us to either decompose our large components or give up on isolated testing.

To make that concrete, the code below repeats the original method and its test, this time assuming that we've extracted an amount method on the purchase, as was shown above:

def average_transaction_amount(user)
  purchases = user.purchases
  total = purchases.map(&:amount).inject(&:+)
  total / purchases.count
end

it "computes average transaction amounts" do
  user = stub(:purchases => [stub(:amount => 40.0)),
                             stub(:amount => 80.0))])
  average_transaction_amount(user).should == 60.0
end

That small change makes a big difference in test readability. The production code has slightly less coupling to the database schema, but the test is much more intelligible as a result. Design isn't just reflected in isolated test setup; it's magnified. Isolated tests are a microscope for object interaction.

When I wrote mock-obsessed tests in that Python system years ago, I was easily averaging two mocks per test, and three sounds very likely. Fast forward to the 2010s: Destroy All Software's Rails app has 197 tests, with 99 total mocks used. (Those are 83 stub objects and 16 mock expectations.) 79% of DAS' tests use no mocks or stubs whatsoever, but a small number of tests use several.

Many of those mock-free tests are on models, which I test in integration with the database. (I use skinny controllers and skinny models, which is far too subtle to explain here. My model design practices are explained in a pair of screencasts: 1, 2.)

As an example of a more mock-heavy test, DAS' ChargePurchase class coordinates the payment processor, the user model, the credit card failure model (they're logged for auditing), the credit card, the mailer, and some constants, ultimately producing an order object. It's about fifty lines long with only three branches, all concerned with handling downstream errors. Most of its work is gluing pieces together to express the process of charging.

Despite doing little work itself, ChargePurchase has nine collaborators: six classes referenced by name and three arguments passed in. Most of those collaborators get stubbed (mocks aren't stubs, but I'm being sloppy with my terminology in this post to match common use). None of the stubs are nested, but that's still quite a bit of setup. I tolerate the unusual level of test setup complexity here because I like having the whole purchase-charging process centralized, allowing me to read through it linearly.

ChargePurchase shows that test setup pain is only part of my design heuristic. Sometimes, as in this case, listening to it too closely will lead to code that's less understandable, so I don't listen. Mock setup exposes coupling, remember; not cohesion or other design properties. We're still on our own for those.

My use of mocks has changed significantly over time: even DAS' mock counts would be much lower if I built it today. First, most of its controllers are unit tested, sometimes with partial mocking. Today, I'd write thin controllers that are only tested at the system level. (Like so many of these points, I discuss that in a screencast with Web Apps: When to Test in Isolation.)

Second, DAS' objects call each other whenever it's immediately convenient for them. Today, I'd extract a functional core wherever it was natural, testing the core in isolation with no test doubles at all. (The Functional Core, Imperative Shell screencast is about that.)

I doubt that anything is stubbed three levels deep, although I haven't done a rigorous audit. Maybe there are a couple of system boundaries where I did a bad job in the early days and stubbed deeply. In any case, it's rare for something to be stubbed even two levels deep. When I spend the effort to isolate with mocks, I'm doing it because I want to listen to the design feedback, not fight it.

In addition to avoiding nested mocks, I've been using fewer over time, even when I'm writing isolated tests. The old Python system I mentioned had multiple mocks per test on average. Early DAS code written under time pressure averages around one mock per test. Later DAS code is under half a mock per test. Moving into late 2013, all of Selecta's logic is tested in isolation with no test doubles at all. (That's Functional Core, Imperative Shell again.)

This post was triggered by Kent's comment about triply-nested mocks. I doubt that he intended to claim that mocking three levels deep is inherent to, or even common in, isolated testing. However, many others have proposed exactly that straw man argument. That argument misrepresents isolated testing to discredit it; it presents deep mocks, which are to be avoided in isolated testing, as being fundamental to it; it's fallacious. It's at the root of the claim that mocking inherently makes tests fragile and refactoring difficult. That's very true of deep mocks, but not very true of mock-based isolation done well, and certainly isn't true of isolation done without mocks.

In a very real sense, isolated testing done test-first exposes design mistakes before they're made. It translates coupling distributed throughout the module into mock setup centralized in the test, and it does that before the coupling is even written down. With practice, that preemptive design feedback becomes internalized in the programmer, granting some of the benefit even when not writing tests. There may be other paths to that skill, but I'm still learning from my tests after seven years of isolating around 50% of the time. This path also happens to produce a trail of sub-millisecond tests fully covering every component designed using it, which is alright with me.