The more I deal with infrastructure architecture, the more I think that somewhere along the way, we have overspecialized. There are too many architects that have never lived with a system in production, or spent time on an operations team. Likewise, there are a lot of operations people that insulate themselves from the specification and development of systems for which they will ultimately take responsibility.
The net result is suboptimization in the hardware/software fit. As a result, overall availability of the application suffers.
Here’s a recent example.
First, we’re trying to address the general issue of flowing data from production back into pre-production systems – QA, production support, development, staging. The first attempt took 6 days to complete. Since the requirements of the QA environment stipulate that the data should be no more than one week out of date relative to production, that’s a big problem. On further investigation, it appears that the DBA who was executing this process spent most of the time doing scps from one host to another. It’s a lot of data, so in one respect 10 hour copies are reasonable.
But the DBA had never been told about the storage architecture. That’s the domain of a separate "enterprise service" group. They are fairly protective of their domain and do not often allow their architecture documents to be distributed. They want to reserve the right to change them at will. Now, they will be quite helpful if you approach them with a storage problem, but the trick is knowing when you have a storage problem on your hands.
You see, all of the servers that the DBA was copying files from and to are all on the same SAN. An scp from one host on the SAN to another host on the SAN is pretty redundant.
There’s an alternative solution that involves a few simple steps: Take a database snapshot onto a set of disks with mirrors, split the mirrors, and join them onto another set of mirrors, then do an RMAN "recovery" from that snapshot into the target database. Total execution time is about 4 hours.
From six days to four hours, just by restating the problem to the right people.
This is not intended to criticize any of the individuals involved. Far from it, they are all top-notch professionals. But the solution required merging the domains of knowledge from these two groups – and the organizational structure explicitly discouraged that merging.
Another recent example.
One of my favorite conferences is the Colorado Software Summit. It’s a very small, intensely technical crowd. I sometimes think half the participants are also speakers. There’s a year-round mailing list for people who are interested in, or have been to, the Summit. These are very skilled and talented people. This is easily the top 1% of the software development field.
Even there, I occasionally see questions about how to handle things like transparent database connection failover. I’ll admit that’s not exactly a journeyman topic. Bring it up at a party and you’ll have plenty of open space to move around in. What surprised me is that there are some fairly standard infrastructure patterns for enabling database connection failover that weren’t known to people with decades of experience in the field. (E.g., cluster software reassigns ownership of a virtual IP address to one node or the other, with all applications using the virtual IP address for connections).
This tells me that we’ve overspecialized, or at least, that the groups are not talking nearly enough. I don’t think it’s possible to be an expert in high availability, infrastructure architecture, enterprise data management, storage solutions, OOA/D, web design, and network architecture. Somehow, we need to find an effective way to create joint solutions, so we don’t have software being developed that’s completely ignorant of its deployment architecture, nor should we have infrastructure investments that are not capable of being used by the software. We need closer ties between operations, architecture, and development.