Join or Log in

<!doctype html><html lang=en><head><meta charset=utf-8><title>SRE School: No Haunted Forests</title><meta property=og:title content="SRE School: No Haunted Forests"><meta name=twitter:title content="SRE School: No Haunted Forests"><meta property=og:type content=article><meta name=author content="John Millikin"><meta property=og:site_name content="John Millikin"><link rel=icon type=image/ href="/static/favicon.ico?v=52992a312ad6032f7d43a8be7fb00b0820a921cb"><meta name=viewport content="width=device-width,minimum-scale=1,initial-scale=1"><script type=application/javascript src="/static/site.js?v=52992a312ad6032f7d43a8be7fb00b0820a921cb"></script><link rel=canonical href=><meta property=og:url href=><meta name=twitter:url href=><meta name=twitter:domain><link rel=stylesheet href="/static/prism.css?v=52992a312ad6032f7d43a8be7fb00b0820a921cb"><script type=application/javascript src="/static/prism.js?v=52992a312ad6032f7d43a8be7fb00b0820a921cb" data-manual></script><template id=nav-template><style>.subtitle{font-weight:400;font-style:italic;font-family:palatino,serif;white-space:normal;font-size:14px;padding:5px 0 0 8px}a{color:#f7f7f7}nav ul{margin:0;padding:0}nav ul li{display:inline-block;list-style-type:none}nav .caret{content:"";display:inline-block;height:0;width:0;vertical-align:middle}nav ul.nav-level-0>li>a>.caret{border-top:4px solid #f7f7f7;border-right:4px solid transparent;border-left:4px solid transparent}nav ul.nav-level-1 li>a>.caret{border-bottom:4px solid transparent;border-top:4px solid transparent;border-right:4px solid transparent;border-left:4px solid #f7f7f7;margin:0 0 0 8px}nav ul.nav-level-1 li:hover>a>.caret{border-left:4px solid #1b1b1b}nav ul>li{border:1px solid #3b5569}nav ul>li a{display:block;padding:12px;text-decoration:none}nav ul>li>div{background-color:#5c7396;color:#f7f7f7;border:1px solid #f7f7f7;display:none;margin:0;opacity:0;position:absolute;visibility:hidden;white-space:pre;max-width:350px}nav ul>li>div ul>li{display:block}nav ul>li>div ul>li>a{display:block;text-decoration:none}nav ul>li>div ul>li:hover>a{background-color:#8ca9c0;color:#1b1b1b!important}nav ul>li:hover>div{display:block;opacity:1;visibility:visible}nav ul.nav-level-0>li>a{text-transform:uppercase;letter-spacing:1px}nav ul.nav-level-0>li:hover{border:1px solid #f7f7f7;border-radius:4px}nav .nav-software{width:150px}nav ul.nav-level-1 li>div{margin-top:-47px;left:150px;min-width:350px}</style><nav><ul class=nav-level-0><li><a href=/>Home</a></li><li><a role=button>References <span class=caret></span></a><div><ul class=nav-level-1><li><a href=/bazel/toolchains>Bazel Toolchains</a></li><li><a href=/effective-grpc>Effective gRPC</a></li><li><a href=/more-effective-go>(More) Effective Go</a></li><li><a href=/monads>Monads<div class=subtitle>Every Haskell programmer is required to post one bad monad tutorial.</div></a></li><li><a href=/the-fuse-protocol>The FUSE Protocol<div class=subtitle>I just want to virtualize a filesystems. Why is this so hard?</div></a></li><li><a href=/unix-syscalls>UNIX Syscalls<div class=subtitle>BSD, Linux, and 40 years of bad decisions walk into a bar.</div></a></li></ul></div></li><li><a role=button>Software <span class=caret></span></a><div class=nav-software><ul class=nav-level-1><li><a role=button>Obsolete <span class=caret></span></a><div><ul class=nav-level-2><li><a href=/software/anansi>Anansi<div class=subtitle>A NoWeb-inspired literate programming preprocessor</div></a></li><li><a href=/software/chell>Chell<div class=subtitle>A quiet test runner for Haskell</div></a></li><li><a href=/software/copper>Copper<div class=subtitle>Crash-proof unit tests for C and C++</div></a></li><li><a href=/software/daily-wtf-contest-entry>Daily WTF Contest Entry<div class=subtitle>For the Worse Than Failure Programming Contest (2007)</div></a></li><li><a href=/software/haskell-cpython>haskell-cpython<div class=subtitle>Calling Python libraries from Haskell</div></a></li><li><a href=/software/haskell-dbus>haskell-dbus<div class=subtitle>D-Bus implementation for Haskell</div></a></li><li><a href=/software/haskell-enumerator>haskell-enumerator<div class=subtitle>An implementation of Oleg Kiselyov’s left-fold enumerators</div></a></li><li><a href=/software/haskell-gnome-keyring>haskell-gnome-keyring<div class=subtitle>GNOME Keyring bindings for Haskell</div></a></li><li><a href=/software/haskell-ncurses>haskell-ncurses<div class=subtitle>NCurses bindings for Haskell</div></a></li><li><a href=/software/haskell-options>haskell-options<div class=subtitle>An easy-to-use command-line option parser for Haskell</div></a></li><li><a href=/software/haskell-re2>haskell-re2<div class=subtitle>re2 bindings for Haskell</div></a></li></ul></div></li><li><a href=/shell-snippets>Shell Snippets</a></li></ul></div></li><li><a role=button>SRE School <span class=caret></span></a><div><ul class=nav-level-1><li><a href=/sre-school/health-checking>Health Checking</a></li><li><a href=/sre-school/instrumentation>Instrumentation</a></li><li><a href=/sre-school/no-haunted-forests>No Haunted Forests</a></li></ul></div></li><li><a role=button style="padding:6px 12px 11px">🤔 <span class=caret></span></a><div><ul class=nav-level-1><li><a href=/🤔/error-beneath-the-wavs>Error Beneath the WAVs<div class=subtitle>Does your CD drive have enough sheep?</div></a></li><li><a href=/🤔/why-i-ripped-the-same-cd-300-times>Why I Ripped The Same CD 300 Times<div class=subtitle>"Plumbing the depths of obsession" – Jeff Atwood</div></a></li><li><a href=/🤔/case-report-surugaya-mojibake>Mojibake in Surugaya Javascript<div class=subtitle>Bad unicode breaks an e-shop.</div></a></li><li><a href=/🤔/python-lambda-only>Python (Lambda Only)<div class=subtitle>Approved by the Ministry of Silly Styles.</div></a></li><li><a href=/🤔/recreators-episode-21>Re:Creators Episode 21<div class=subtitle>Lets nitpick Latin grammar in a cartoon.</div></a></li></ul></div></li><li><a role=button>Other <span class=caret></span></a><div><ul class=nav-level-1><li><a href=/links>Links</a></li><li><a href=/reddit-the-good-parts>Reddit: The Good Parts</a></li><li><a href=/reddit-front-page-2018>Reddit Front Page (2018)<div class=subtitle>Removing 9gag from Reddit is like unscrambling an egg.</div></a></li></ul></div></li><li style=vertical-align:sub><a href=/changes.xml title="Change Feed" style=padding:8px data-no-instant><img src=/static/feed-icon-24x24.png style=width:19px;height:19px></a></li></ul></nav></template></head><body><blog-layout><blog-article posted=2018-11-01T06:19:20Z><h1 slot=title>SRE School: No Haunted Forests</h1><p>All industrial codebases contain bad code. To err is human, and situations get very human when you're staring down the barrel of a launch deadline. You've heard the euphemism <i>tech debt</i>, where like a car loan you hold a recurring obligation in exchange for immediate liquidity. But this is misleading: bad code is not merely overhead, it also reduces optionality for all teams that come in contact with it. Imagine being unable to get indoor plumbing because your neighbor has a mortgage!</p><p>Thus a better analogy for bad code is a haunted forest. Bad code negatively affects everything around it, so engineers will write ad-hoc scripts and shims to protect themselves from direct contact with the bad code. After the authors move to other projects, their hard work will join the forest.</p><p>Healthy engineering orgs do not tolerate the presence of haunted forests. When one is discovered you must move vigorously to contain, understand, and eradicate it.</p><p>Make this the motto of your team: No Haunted Forests!</p><div style="float:left;margin:0 2em 2em 0"><img src=/sre-school/no-haunted-forests/322330_20181030192733_1.png style=max-width:400px><p style=margin-top:.5em><i>Engineer debugging a Puppet manifest (2018, colorized)</i></p></div><blog-section><h2 slot=title>Identifying a Haunted Forest</h2><p>Not all intimidating or unmaintained codebases are haunted forests. Code may be difficult for a newcomer to come up to speed, or it might be a stable implementation of some RFC. A couple rules of thumb to identify code worthy of a complete rewrite:</p><ul><li>Nobody at the company understands how the code should<blog-footnote-ref>[<a href=#fn:1>1</a>]</blog-footnote-ref> behave.</li><li>It is obvious to everyone on the team<blog-footnote-ref>[<a href=#fn:2>2</a>]</blog-footnote-ref> that the current implementation is not acceptable.</li><li>The project's missing features or erroneous behavior is impacting other teams.</li><li>At least one competent engineer has attempted to improve the existing codebase, and failed for technical reasons.</li><li>The codebase is resistant to static analysis, unit testing, interactive debuggers, and other fundamental tooling.</li></ul></blog-section><blog-section><h2 slot=title>Haunted Environmentalists</h2><p>Fresh graduates often push for a rewrite at the first sign of complexity, because they've spent the last four years in an environment where codebase lifetimes are measured in weeks. After their first unsuccessful rewrite they will evolve into Junior Engineers, repeating the parable of <a href=>Chesterton's Fence</a> and linking to that old Joel Spolsky thunkpiece about Netscape<blog-footnote-ref>[<a href=#fn:3>3</a>]</blog-footnote-ref>.</p><p>Be careful not to confuse this reactive anti-rewrite sentiment with true objections to your particular rewrite. Remind them that Joel wrote that when source control meant <a href=>CVS</a>.</p></blog-section><blog-section><h2 slot=title>Clearing Haunted Forests</h2><p>Rewriting an existing codebase should be modeled as a special case of a migration. Don't try to replace the whole thing at once: systematize how users interact with the existing code, insert strong API boundaries between subsystems, and make changes intentionally.</p><p><b>User Interaction</b> will make or break your rewrite. You must understand what the touch-points are for users of the existing system to avoid exposing them to maintain <a href=>UI Compatibility</a>. Often rewrites mandate some changes, so try to put them all near the start (if you know what the final state should be) or delay them to the end (when you can make it seem like a big-bang migration). If the user-facing changes are significant, see if you can arrange for separate opt-in and opt-out periods during which both interaction modes co-exist.</p><p><b>Subsystem API Boundaries</b> let you carve up the old system into chunks that are easier to reason about. Be fairly strict about this: run the components in separate processes, separate machines, or whatever is needed to guarantee that your new API is the only mechanism they have to communicate. Do this recursively until the components are small enough that rewriting them from scratch is tedious instead of frightening.</p><p><b>Intentional Changes</b> happen when the new codebase's behavior is forced to deviate from the old. At this point you should have a good idea which behavior, if either, is correct. If there's no single correct behavior, it's fine to settle for "predictable" or (in the limit) "deterministic". By making changes intentionally you minimize the chances of forced rollbacks, and may even be able to detect users depending on the old behavior.</p><p>Work incrementally. A good rewrite is valid and fully functional at any given checkpoint, which might be commits or nightly builds or tagged releases. The important thing is that you never get into a state where you're forced to roll back a functional part of the new system due to breakage in another part.</p></blog-section><blog-section><h2 slot=title>Common Features of Haunted Forests</h2><p>All bad code is bad in its own special way, but there are some properties that are especially likely to make it hard to refactor incrementally. These are generally programming styles that hide state, obscure control flow, or permit type confusion.</p><p><b>Hidden State</b> means mutable <a href=>global variables</a> and <a href=>dynamic scoping</a>. Both of these inhibit a reader's understanding of what code will do, and forces them to resort to logging or debuggers. They're like catnip for junior developers, who value succinct code but haven't yet been forced to debug someone else's succinct code at 3 AM on a Sunday.</p><p><b>Non-Local Control Flow</b> prevents a reader from understanding what path execution will take. In the old times this meant <code>setjmp</code> and <code>longjmp</code>, but nowadays you'll see it in the form of callbacks and event loops. Python's <a href=>Twisted</a> and Ruby's <a href=>EventMachine</a> can easily turn into global callback dispatchers, preventing static analysis and rendering stack traces useless.</p><p><b>Dynamic Types</b> require careful and thoughtful programming practices to avoid turning into "type soup". Highly magical metaprogramming like <code>getattr</code> or <code>method_missing</code> are trivially easy to abuse in ways that make even trivial bug fixes too risky to attempt. Tooling such as <a href=>Mypy</a> and <a href=>Flow</a> can help here, but introducing them into an existing haunted forest is unlikely to have significant impact. Use them in the new codebase from the start, and they might be able to reclaim portions of the original code.</p><p><b>Distributed Systems</b> can become haunted forests through sheer size, if no single person is capable of understanding the entire API surface they provide. Note that microservices don't automatically prevent this, because merely splitting up a monolith turns the internal structure into API surface. Each of the above per-process issues has distributed analogues, for example S3 is global mutable state and JSON-over-HTTP is dynamically typed.</p></blog-section><blog-footnotes slot=footnotes><hr><ol><li id=fn:1><p>A codebase where nobody knows what behavior it <i>currently has</i> is materially different from one where nobody understands what behavior it <i>should have</i>. The former don't need to be rewritten, because you can grind their test coverage up and then safely refactor.</p></li><li id=fn:2><p>You will sometimes hear objections from people who have not worked directly on the bad code, but have opinions about it anyway. Let them know that they're welcome to help out and you can arrange for a temporary rotation into the role of Forest Ranger.</p></li><li id=fn:3><p>The <i>real</i> reason Netscape failed is they wrote a dreadful browser, then spent three years writing a second dreadful browser. The fourth rewrite (Firefox) briefly had a chance at being the most popular browser, until Google's rewrite of <a href=>Konqueror</a> took the lead. The moral of this story: rewrites are a good idea if the new version will be better.</p></li></ol></blog-footnotes></blog-article></blog-layout></body></html>

Join to follow Paul Asselin