On June 23, 2008, I'll be reprising my popular Incident Command for IT: What We Can Learn from the Fire Department talk for the O'Reilly Velocity Web Performance and Operations Conference. From the blurb for my talk:

The Incident Command System (ICS) is used by public safety agencies nationwide (fire departments, police departments, Coast Guard, etc.) to manage emergency responses to events ranging from single-vehicle car crashes to wildfires involving thousands of personnel. It provides a standardized organizational structure and set of operating principles for quickly and effectively coordinating the efforts of multiple parties in response to an evolving incident even as the response changes in scope, scale, and focus.

In this talk, Brent will introduce the concepts and principles of ICS, and discuss how these can be applied to IT events, such as security incidents and service outages.

As I've mentioned before, besides my professional work in the networking field, I do a lot of volunteer emergency services work. For example, I used to be one of only about 40 fully-qualified air search and rescue Incident Commanders in the California Wing of the Civil Air Patrol, and I currently help teach community disaster preparedness classes for the Mountain View Fire Department. So, I have a fair understanding of the tools (methods, structures, and principles) that such agencies use to organize themselves to deal with emergencies, and I've long pondered how some of those tools could be applied to emergencies in information technology.

The conference program looks pretty interesting; I'm looking forward to sitting in on several of the other talks.

Velocity, the Web Performance and Operations Conference 2008

A few weeks ago, I installed a NetGear ReadyNAS NV+ file server appliance for my home office. So far, I'm loving it!

For about $1000, I've got a 1TB RAID fileserver. Since mine is currently configured with 2 500GB drives, I've got 500GB of useable RAID-protected space. There are 2 still-open SATA drive slots, to which I can add more drives; if I use 500GB drives, I could get to 2TB of total space, 1.5TB of that being useable and RAID-protected. When I want to add more drives, I just plug them in; the box will take care of all the details of configuring the new drives, updating the RAID configuration, shuffling data as necessary between the drives, etc. Furthermore, the new drives don't have to be identical to the existing ones; they can be whatever size SATA drive offers the best price-performance at the time, and the box will "do the right thing" to maximize the amount of RAID-protected available to me.

The device includes a 10/100/1000 Ethernet port, and supports Jumbo Frames for high performance. Management is through a fairly nice web interface. It will serve files via NFS, SMB (for Windows), AFS (for Macintosh), and even HTTP (for convenient access from elsewhere). It also supports access via rcp/rsync, both inbound and outbound. You can set up user accounts on the device itself, or tie it in to Windows domain or Active Directory Server.

Other cool features include the built-in iTunes and Slingbox servers, so that if you store your music on the box, it's available to iTunes or Slingbox anywhere on your LAN.

The device includes 3 USB ports (2 on the back, 1 on the front), which you can use for a number of things:

  • UPS monitoring -- if you connect a supported UPS, it will monitor the UPS status, and automatically do a clean shutdown if the batteries run low
  • Printer sharing -- enables you to share a USB printer among computers on your LAN
  • External hard disk -- for backups. You can even set up a button on the front to enable "one-touch" backups to an external hard disk, which you can then store off-site.

Setup was fairly straightforward. I moved all my stuff from external drives on my laptop (backups and iTunes library) to the server, and have been running nightly backups to it for the past couple of weeks. I haven't done any careful performance benchmarking, but Retrospect backups from my Mac laptop via AFP over a GigE connection seem to be about twice as fast as from that same laptop to an external disk via FireWire.

The device is reasonably quiet; certainly no noisier than the pair of external FireWire disks for my Mac that I'd previously been using.

One of the things I like about the server is that you can tell it to email you (at multiple addresses) about any significant events. That's cool, since it means I don't have to periodically scan the logs. I have it set up to email directly to my phone, as well as my normal email address.

About the only annoyance so far is that "UPS activated/deactivated" is one of the "significant events" that it emails about. Unfortunately, since my UPS is on the same electrical circuit as my laser printer, whenever the laser printer warms up (i.e., the first time I print something when the laser printer has been idle for more than a couple of hours), the UPS trips for a couple of seconds, and I get two messages to my phone in quick succession: "UPS activated" and "UPS deactivated".

There's a very detailed review, including lots of photos, screenshots, and the like, available at Barry's Rigs 'n Reviews. I bought my unit from eAegis because they offered a good price that included initial setup and burn-in testing at their end. The product is also available from Amazon in various configurations:

Great Circle has moved to Alameda

Great Circle Associates has moved once again, this time to Alameda. We had fun in San Francisco, but hopefully this will be our last move a while! (Though, of course, helping clients move is one of our key consulting services.)

We're still serving clients in Silicon Valley and throughout the rest of the world.

Our new contact info (our mailing address has changed, but our phone numbers remain the same):

Great Circle Associates, Inc.
2608 Buena Vista Ave.
Alameda, CA 94501

WWW: www.greatcircle.com
USA Toll Free: 877 GRT CRCL (877 478 2725)
International: +1 415 861 3588
Fax: +1 415 552 2982

Managing patch cables

| | Comments (4)

A colleague recently asked if I had any recommendations for software to manage lists of patch cables and ports, and I think my answer might have surprised him...

There are a variety of Visio add-ons and standalone tools available, and while I've often found tools like that useful for planning initial installations, they haven't been so useful for ongoing maintenance. The problem is, whatever documentation you create gets out date pretty quickly, unless you're very disciplined about it, which almost nobody is...

Instead, I've found it most useful to simply follow good cable management practices:

  • Labelling both ends of all cables with a unique identifier, but not what it's currently used for (because that will inevitably change, and the only thing worse than no label is an incorrect one)
  • Always taking the time to dress cables in neatly, rather than draping them haphazardly.
  • Removing cables when you disconnect one end, rather than just leaving them hanging.
  • Using easy-to-change physical cable management systems, like clips and velcro, rather than hard-to-change systems like cable ties, so that it's easy to "do it right" (see the above 2 points).
  • Using cables of just the right length, rather than too-long cables that you then somehow have to manage the excess for (or too-short cables with a patch hidden somewhere inaccessible in the middle of the run). This means keeping a selection of cables available in various lengths, so that the right one is available when you need it.
  • Having and religiously following a color coding scheme.

As usual, Limoncelli and Hogan offer good advice on this topic in their indispensable book The Practice of System and Network Administration (chapter 17, especially sections 17.1.7 and 17.1.8).

After Hurricane Katrina last year, I spent some time in Mississippi and Louisiana working on disaster relief efforts, along with many others high-tech professionals. Most of us were part of hastily organized, ad hoc efforts, with little or nothing in the way of pre-event planning, training, and preparation.

I'm now proud to be part of a new organization that is in the process of forming, called TechReach International. TechReach's goal is to deploy no-cost telecommunications services into humanitarian relief efforts (both domestic disaster relief operations, and international humanitarian operations), using trained/certified volunteer communications specialists and state of the art technology.

TechReach will be hosting a "Simulation Day" in Mountain View (on the Intuit campus) on Tuesday, 19 Sep 2006, 12:30pm-6:00pm. We invite you to stop by, meet us, learn about the organization, see demos of various disaster relief communication and networking technologies, hear a variety of interesting speakers, and contemplate joining or supporting us. Full details are on the web site at http://www.techri.org/.

I hope to see you there!

Great Circle Associates has moved to San Francisco, though we're still serving clients in Silicon Valley and throughout the rest of the world.

Our new contact info:

Great Circle Associates, Inc.
519 Duboce Ave.
San Francisco, CA 94117

WWW: www.greatcircle.com
USA Toll Free: 877 GRT CRCL (877 478 2725)
International: +1 415 861 3588
Fax: +1 415 552 2982

LOPSA board member Trey Harris has posted an excellent message outlining his thoughts on effective organization and scheduling for groups of sysadmins in a high-interrupt, high-profile, high-availability environment (Amazon, Google, etc.).

Trey's message was part of a very interesting discussion taking place on the LOPSA "discuss" mailing list regarding "interruptions coverage" for sysadmins. The basic question under discussion is, given that much of system administration work is by nature interrupt-driven, how can an organization best shield some of its sysadmins' time from these interrupts, so that the sysadmins can get long-term work done (and maintain their own sanity!)? To read the whole thread, search for "Interruptions coverage" in the list's archive.

I think that this discussion (and Trey's contribution in particular) is an excellent example of the sort of thoughtful discussions from experienced professionals which you can expect from LOPSA, which is why I'm encouraging everyone involved with system administration to join and support this important new organization.

USENIX has made available an audio recording (in MP3 format) of the Incident Command for IT: What We Can Learn from the Fire Department invited talk (Adobe Acrobat PDF format) that I did at the 2005 LISA conference a couple of weeks ago. You'll want to skip past approximately the first 3 minutes (2 minutes, 56 seconds, to be exact) of the recording, which are silence and administrivia announcements from before the start of the presentation; it would have been nice if USENIX had edited that out, but they didn't.

I'm back home today after two weeks on the road, and it's good to be home. The first week, I was in San Diego for the annual USENIX LISA conference (where LOPSA was a major topic of discussion), then I was home just long enough to do laundry and repack before heading out to New York City for the Interop conference and exhibition (where I was helping Splunk showcase their product in the InteropNet NOC).


At LISA, I participated in the Social Technologies and Advanced Technologies Workshops (small, day-long discussions among senior practitioners who are especially interested in a particular topic), gave an invited talk on Incident Command for IT: What We Can Learn from the Fire Department (Adobe Acrobat PDF version), and chaired a Birds of a Feather session on network automation.

Much of the "hallway track" discussion at LISA, of course, was about the implosion of SAGE and subsequent formation of LOPSA, the League of Professional System Administrators. I think that LOPSA has a chance to become a world-class organization, but the first year or so are going to be critical, and it isn't going to happen if everybody takes a "wait and see" attitude. What's going to matter most are money and membership numbers. To try and make sure LOPSA reaches critical mass, I've joined as a Platinum Individual Sponsor, and I encourage others to join/donate/volunteer as they are able.

Interop and Splunk

At Interop, I was working in the NOC for the InteropNet show network, helping Splunk showcase their product. Splunk's product is a really powerful troubleshooting tool for interactively working your way through huge volumes of any sort of text-based system logs (syslog, SNMP traps, whatever). It is designed as a tool for use by smart people who understand what the log messages mean, if only they could wade through the flood; the software doesn't try to understand the messages itself, just makes it easier for a system/network/security administrator to navigate through the flood of messages and find the needle in the haystack. There's a very powerful free version of their software (the paid version adds various features, but the free version is fully functional by itself) available from the Splunk web site; check it out!

If you are going to be at the USENIX/SAGE LISA conference in San Diego in early December, I've scheduled a Network Automation BoF (Birds of a Feather session, where folks interested in a particular topic get together to chat about it) for Thursday night, 8 December 2005, 8:00-9:00pm (right after the conference reception). Right now, they've got us scheduled in Garden Salon 1, but that's subject to change, so check the scheduling board at the conference. I hope to see you there!

BoF info:

Automating Network Configuration & Management

Organizer/Moderator: Brent Chapman, Great Circle Associates
Thursday, 8 December 2005, 8:00 pm-9:00 pm, Garden Salon 1

What's the state of the art for automated network configuration and management? What systems and tools are available, either freely or commercially? Where are these issues being considered and discussed?

Over the last 15 years or so, much of the research in the system administration field has focused on automation. It's now well accepted that a well-run operation doesn't manage 10,000 servers individually, but rather uses tools like cfengine to manage definitions of those servers and then create instances of those servers as needed. In the networking world, though, most of us seem to be still manually configuring (and reconfiguring) every device.

Further info: