Architecture refactoring

Here is a short excerpt from my forthcoming book on software architecture:

Over time, a team understands increasingly well how a system should be designed. This is true regardless of which kind of process a team follows (e.g., waterfall or iterative processes). In the beginning, they know and understand less. After some work (design, prototyping, iterations, etc.) they have better grounded opinions on suitable designs.

Once they recognize that their code does not represent the best design (e.g., by detecting code smells), they have two choices. One is to ignore the divergence, which yields technical debt. If allowed to accumulate, the system will become a Big Ball of Mud. The other is to refactor the code, which keeps it maintainable. This second option is well described by Brian Foote and William Opdyke in their patterns on software lifecycle.

Refactoring, by definition, means re-design and its scale can vary. Sometimes the refactorings involve just a handful of objects or some localized code. But other times it involves more sweeping architectural changes, called architecture refactoring. Since little published guidance exists, architectural refactoring is generally performed ad hoc.

Example: Rackspace ——————

The Rackspace company manages hosted computers that serve email. Customers will call up for help when they experience problems. To help the customer, Rackspace must search the log files that record what has happened during email processing. Because the volume of emails they handle kept increasing, Rackspace built three systems to handle the customer queries.

Version 1: Local log files

The first version of the program was simple. There were already dozens of email servers generating log files. Rackspace wrote a script that would use ssh to connect to each machine and execute a grep query on the mail log file. Engineers could control the search results by adjusting the grep query.

This version initially worked well, but over time the number of searches increased and the overhead of running those searches was noticeable on the email servers. Also, it required an engineer to perform the search rather than a support tech.

Version 2: Central database

The second version addressed the drawbacks of the first version by moving the log data off of the email servers and by making it searchable by support techs. Every few minutes, each email server would send its recent log data to a central machine where it was loaded into a relational database. Support techs had access to the database data via a web-based interface.

Rackspace was now handling hundreds of email servers so the volume of log data was significant. Their challenge was to get the log data into the database as quickly and efficiently as possible. They settled on bulk record insertion into merge tables, which enabled loading of the data in only two or three minutes. Only three days of logs were kept so that the database size would not hinder performance.

This system too encountered problems. The database server was a single machine and, because of the constant loading of data and heavy query volume, it was pushed to its limit with heavy CPU and disk loads. Wildcard searches were prohibited because of the extra load they put on the server. As the amount of log data grew, searches became slower. The server encountered seemingly random failures that were increasingly frequent. Dropped log data was gone forever since it was not backed up. These problems led to a loss of confidence in the system.

Version 3: Indexing cluster

The third version addressed the drawbacks of the first by saving log data into a distributed file system and by parallelizing the indexing of log data. Instead of running on a single powerful machine, it uses ten commodity machines. Log data from the email servers is streamed into the Hadoop Distributed File System which keeps three copies of everything on different disks. At the time of their report, it had over six terabytes of data spanning thirty disk drives, which represents six months of search indexes.

Indexing is performed using Hadoop, which divides (or “maps”) the input data, schedules it to be indexed in jobs, then combines (or “reduces”) the partial results into a complete index. Jobs run every ten minutes and take about five minutes to complete, so index results are about fifteen minutes stale. Rackspace is able to index over 140 gigabytes of log data per day and has executed over 150,000 jobs since they started the system.

As in the second system, support techs have access via a web interface much like a web search engine interface. Query results are available within seconds. When engineers think up new questions about the data, they can write a new kind of job and have their answer within a few hours.

Architecture Refactoring

This example is best thought of as architecture refactoring. Each refactoring of the architecture was precipitated by a pressing failure risk. Object-level refactorings take a negligible amount of time and therefore need little justification, so you should just go ahead and rename that variable to be more expressive of its intent. An architecture refactoring is expensive, so it requires a significant risk to justify it.

Two important lessons are apparent. First, *design does not exclusively happen up-front*. It is often reasonable to spend time up-front making the best choices you can, but it is optimistic to think you know enough to get all those design decisions right. You should anticipate spending time designing after your project’s inception.

Second, **failure risk can guide architecture refactoring**. By the time it is implemented, nearly every system is out of date compared to the best thinking of its developers. That is, some technical debt exists. Perhaps, in hindsight, you wish you had chosen a different architecture. Risks can help you decide how bad it will be if you keep your current architecture.