Architecture refactoring

Feb 23, 2010 | George Fairbanks

Here is a short excerpt from my forthcoming book on software architecture:

Over time, a team understands increasingly well how a system should be
designed. This is true regardless of which kind of process a team
follows (e.g., waterfall or iterative processes). In the beginning,
they know and understand less. After some work (design, prototyping,
iterations, etc.) they have better grounded opinions on suitable

Once they recognize that their code does not represent the best design
(e.g., by detecting code
), they have two
choices. One is to ignore the divergence, which yields technical
debt. If allowed to accumulate, the system will become a Big Ball of
. The other is to refactor the code,
which keeps it maintainable. This second option is well described by
Brian Foote and William Opdyke in their patterns on software

Refactoring, by definition,
means re-design and its scale can vary. Sometimes the refactorings
involve just a handful of objects or some localized code. But other
times it involves more sweeping architectural changes, called
architecture refactoring. Since little published guidance exists,
architectural refactoring is generally performed ad hoc.

Example: Rackspace ——————

The Rackspace company manages hosted computers that serve
email. Customers will call up for help when they experience
problems. To help the customer, Rackspace must search the log files
that record what has happened during email processing. Because the
volume of emails they handle kept increasing, Rackspace built three
systems to handle the customer

Version 1: Local log files

The first version of the program was simple. There were already dozens
of email servers generating log files. Rackspace wrote a script that
would use ssh to connect to each machine and execute a grep query on
the mail log file. Engineers could control the search results by
adjusting the grep query.

This version initially worked well, but over time the number of
searches increased and the overhead of running those searches was
noticeable on the email servers. Also, it required an engineer to
perform the search rather than a support tech.

Version 2: Central database

The second version addressed the drawbacks of the first version by
moving the log data off of the email servers and by making it
searchable by support techs. Every few minutes, each email server
would send its recent log data to a central machine where it was
loaded into a relational database. Support techs had access to the
database data via a web-based interface.

Rackspace was now handling hundreds of email servers so the volume of
log data was significant. Their challenge was to get the log data into
the database as quickly and efficiently as possible. They settled on
bulk record insertion into merge tables, which enabled loading of the
data in only two or three minutes. Only three days of logs were kept
so that the database size would not hinder performance.

This system too encountered problems. The database server was a single
machine and, because of the constant loading of data and heavy query
volume, it was pushed to its limit with heavy CPU and disk
loads. Wildcard searches were prohibited because of the extra load
they put on the server. As the amount of log data grew, searches
became slower. The server encountered seemingly random failures that
were increasingly frequent. Dropped log data was gone forever since it
was not backed up. These problems led to a loss of confidence in the

Version 3: Indexing cluster

The third version addressed the drawbacks of the first by saving log
data into a distributed file system and by parallelizing the indexing
of log data. Instead of running on a single powerful machine, it uses
ten commodity machines. Log data from the email servers is streamed
into the Hadoop Distributed File System which keeps three copies of
everything on different disks. At the time of their report, it had
over six terabytes of data spanning thirty disk drives, which
represents six months of search indexes.

Indexing is performed using Hadoop, which divides (or “maps”) the
input data, schedules it to be indexed in jobs, then combines (or
“reduces”) the partial results into a complete index. Jobs run every
ten minutes and take about five minutes to complete, so index results
are about fifteen minutes stale. Rackspace is able to index over 140
gigabytes of log data per day and has executed over 150,000 jobs since
they started the system.

As in the second system, support techs have access via a web interface
much like a web search engine interface. Query results are available
within seconds. When engineers think up new questions about the data,
they can write a new kind of job and have their answer within a few

Architecture Refactoring

This example is best thought of as architecture refactoring. Each
refactoring of the architecture was precipitated by a pressing failure
risk. Object-level refactorings take a negligible amount of time and
therefore need little justification, so you should just go ahead and
rename that variable to be more expressive of its intent. An
architecture refactoring is expensive, so it requires a significant
risk to justify it.

Two important lessons are apparent. First, *design does not
exclusively happen up-front*. It is often reasonable to spend time
up-front making the best choices you can, but it is optimistic to
think you know enough to get all those design decisions right. You
should anticipate spending time designing after your project’s

Second, **failure risk can guide architecture refactoring**. By the
time it is implemented, nearly every system is out of date compared to
the best thinking of its developers. That is, some technical debt
exists. Perhaps, in hindsight, you wish you had chosen a different
architecture. Risks can help you decide how bad it will be if you keep
your current architecture.


George Fairbanks is a software developer, designer, and architect living in New York city

+1-303-834-7760 (Recruiters: Please do not call)
Twitter: @GHFairbanks