On Organizing People and Work at a VoIP Service Provider

VoIP service providers these days face the technical challenges of huge flexibility, and no single integrated solution with interop-tested partner devices. You can't just buy a "switch", plug in some TDM/SONET transport and turn up "smart remotes" made by the switch manufacturer.

Even integrated VoIP systems like MetaSwitch leave a lot of design space:

-- What signaling protocols?

-- Which of the many packet networks will be used (Ethernet-only, DSL, Mix of IP transports)?

-- How will high-performance network capacity be ensured? (prioritization? reserved bandwidth? Classified by DSCP? Port number?)

-- How will unauthorized access be prevented? (SBC? Firewall? No default route?)

-- What CPE will be used? And which features will be used? Which signaling variants will be used for, say, putting calls on hold?

-- How will you detect faults? Dry alarm contacts? End-to-end testing? SNMP traps? SNMP polling? Threshold/Statistical-norm monitoring?

My point is that there's a lot to it. Once you outgrow trivially small, you need to split up the work.

I've been brainstorming about ways to organize people to design and operate a VoIP carrier. If you need more than one person, then there's a question: how do you divide work among the people?

This reminds me of Fred Brooks's essays on organizing the development of large software systems. Voice network design operation is somewhat more constrained than are large software systems. But it's still important to have overall coherence.

I wonder how carriers would think about this if VoIP were delivered in a few 84-inch-tall cabinets with opaque slide-in cards? I.e., what if it looked like a 5ESS or DMS?

But VoIP isn't built that way. A cabinet at a VoIP carrier might have equipment from half a dozen independent vendors. And the carrier has to deal with them separately.

Below are some models I've considered. I evaluate the models with these factors in mind:

(a) Does any ONE person within the technical team claim responsibility for correct operation of the whole product/service? This is critical.

(b) Is there normally enough work to keep a person busy? ...intellectually stim-yoo-lated?

(c) Can the model accomodate staff turnover, vacation, around-the-clock emergency work?

(d) Is authority to design the system and fix problems vested in one group, or spread out all over the place?

Though competent engineers should be able to run the whole platform, these tasks are fairly easy to factor out of the engineering group:

(a) Watch the NOC screen and determine when something seems wrong

(b) Take tier-1 customer calls/complaints

(c) On-site installation

One thing I don't consider here is busywork-avoidance: how do you ensure engineers spend their time doing important things? Because people are really good at thinking of minor enhancements and seeking input. For example, if I'm allowed to spend three hours writing a blog article, it might end up really long and thorough, but it might only be 20% better than the one-hour article. But imagine if I'm given all day to write it! It might be a full 30% better than the one-hour article. But it WILL be better.

If this tendency to comprehensiveness is allowed to run unrestrained, engineers will never get a product off the ground. There's always a new way to look at the problem and build the system.

Alternately, we might have a reasonable understanding of the problem, but one more group conference call might make things slightly better.

I call this work that doesn't lead to significantly better results as "busywork". And I don't really address it directly in the organizational structures described here.

------------

The Integrated Model: everybody in the team can do everything. There's a senior-most engineer overall, but he can delegate any task to any member of the group. Some members have more experience on a specific topic. The compensation structure encourages teaching each other, but encourages doing the work yourself even more.

Advantages:

-- Every engineer understands the whole system, and can work on any problem

-- Often intellectually satisfying to engineers

-- Ensures whole team stays up-to-date and no member becomes obsolete

-- Senior-most engineer has overall responsibility for a working design

-- Work scales as overall workload increases

Limitations:

-- Requires fearless, hard-working staff

-- Strong leadership required to avoid over-specialization

-- Long training process

-- Engineers might feel like jacks of all trades, but masters of none.

-- Certain tasks have to be arbitrarily assigned. E.g., who will monitor for critical software updates? Otherwise these tasks might go un-done.

-------

Layered model: Responsibility is broken into layers mimicking the network. Somebody is responsible for layers 1-3 (cables, switches, transport, routing), while somebody else is responsible for layer 4 (operating systems), somebody else does the application layer.

Advantages:

-- The top-layer application engineer has responsibility for the overall product.

-- Exploits deep expertise in specializations. E.g. The network guy understands networking regardless of who manufactured the device; the application guy knows RTP codecs and how the whole real-time audio process works.

-- Encourages cross-platform experience. E.g. SIP guy knows SIP across all platforms

-- Once the layers are divided up amongst the staff, the responsibilities are well defined because they match the software/system responsibilities.

Limitations:

-- The top-layer application engineer may not have knowledge/resources to troubleshoot all problems, since he's specialized in only one area.

-- Some layers won't have enough to do, while other layers will be overworked. So this model may only make sense if you have a large group so the slowest job can keep busy.

-- ...but, it's not clear how to divide work as you scale up. E.g., what if there's too much work for one the application-layer guy?

---------

Device model: each physical object is "owned" by somebody, who's responsible for everything on that.

Advantages:

-- Neatly fits asset ownership within an organization (maybe)

-- Engineers can have deep experience in their devices. E.g., Cisco cert here, Nortel cert there

-- Easy to point fingers...I mean, divide responsibility...corresponding to physical interfaces on the devices.

Limitations:

-- Wastes expertise. e.g., if one guy owns a cisco router and another owns a nortel router, you've got two people with half a job with largely overlapping experience, probably bored with their job

-- Nobody really understands the whole product

-- Easy to point fingers...I mean, divide responsibility...corresponding to physical interfaces on the devices. "The packet is leaving my box...it must be on your end."

---------

The SME model: the system is broken into specific specialties by subject matter, and individuals are assigned to be experts in those areas. SM's might include: SBC, App Server, Net Server, Billing, Provisioning, Client Call Control, PSTN access, SIP, MGCP, fault detection, etc.

Advantages:

-- Promotes deep expertise in the subject matter

-- Matches an academic model of divide research topics

Limitations:

-- Very weak on design -- no engineer has overall design authority or comprehension

-- SMEs may know their area in the abstract, but have trouble with the overall application

-- Being an SME may not be enough to keep a person busy

----

The Knuth model (SME-graph): Like the SME model, but each specialty has at least two experts, and each expert has at least two specialties. The purpose is to ensure all the specialties are connected, so it's best to require dissimilar expertise.

Advantages:

-- Overall system is owned by the whole team, so decisions aren't made in isolation

Disadvantages:

-- No single technical mind responsible for overall product operation and uniform design. This may result in an overly-complicated, fragile system.

---------

Network Region Model: The network is divided into segments, and engineers are assigned regions. Then they do everything within their region. E.g., one engineer might handle the customer access network, another handles the outside of the SBC (i.e. the core network part connected to the Internet), and another handles the inside of the core.

Advantages:

-- Engineers work across functions within their network region, e.g., physical layer, switching, routing, application, performance, etc.

Limitations:

-- Nobody technical owns the whole product