Rob England: Problem Management Defined

Problem Management DefinedRailways (railroads) remind us of how the real world works.

In our last article, we left Cherry Valley, Illinois in its own little piece of hell.

For those who missed the article, in 2009 a Canadian National railroad train carrying eight million litres of ethanol derailed at a level crossing in the little town of Cherry Valley after torrential rain washed out the roadbed beneath the track. 19 tankers of ethanol derailed, 13 of them split or spilled, and the mess somehow caught fire in the downpour.

One person in the cars waiting at the crossing died and several more were seriously injured.

Incidents vs. Problems

In that previous article we looked at the Incident Management. As I said then, an incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes. I also mentioned that ITIL doesn’t see it that crisply delineated. Anyway, let us return to Cherry Valley…

One group of people worked inside office buildings making sure the trains kept rolling around the obstruction so that the railroad met its service obligations to its users. This was the Incident Management practice: restoring service to the users, focusing on perishable deliveries such as livestock and fruit.

Another group thrashed around in the chaos that was Cherry Valley, trying to fix a situation that was very very broken. Their initial goal was containment: save and treat people in vehicles, evacuate surrounding houses, stop the fire, stop the spills, move the other 100 tank-cars of ethanol away, get rid of all this damn flooding and mud.

The Shoo-fly

The intermediate goal was repair and restore: get trains running again. Often this is done with a “shoo-fly”: a temporary stretch of track laid around the break, which trains inch gingerly across whilst more permanent repairs are effected. This is not a Workaround as we use the term in ITSM. The Workaround was to get trains onto alternate routes or pass freight to other companies. A shoofly is temporary infrastructure: it is part of the problem fix just as a temporary VM server instance would be. While freight ran on other roads or on a shoofly, they would crane the derailed tankers back onto the track or cart them away, then start the big job of rebuilding the road-base that had washed away – hopefully with better drains this time – and relaying the track. Compared to civil engineering our IT repairs look quick, and certainly less strenuous.

Which brings us to the longer-term goal: permanent remediation of the problem. Not only does the permanent fix include new rail roadbed and proper drainage; the accident report makes it clear that CN’s procedures and communications were deficient as well. Cherry Valley locals were calling 911 an hour beforehand to report the wash-out.

Damage Limitation

We will talk more about the root causes and long term improvement later. Let’s stay in Cherry Valley for now. It is important to note that the lives and property the emergency responders were saving were unconnected to the services, users or customers of the railroad. All the people working on all these aspects of the problem had only a secondary interest in the timeliness of pigs and oranges and expensive petrol. They were not measured on freight delivery times: they were measured on speed, quality and permanence of the fix, and prevention of any further damage.

If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. To me it is pretty clear: Incident and Problem practices are delineated by different activities, teams, skills, techniques, tools, goals and metrics. Incident: user service levels. Problem: causes.

While I am arguing with ITIL definitions, let’s look at another aspect of Incidents. ITIL says that something broken is an Incident if it could potentially cause a service interruption in future. Once again this ignores the purpose, roles, skills and tools of Incident Management and Problem Management. Such a fault is clearly a Problem, a (future) cause of an Incident.

(Incidentally, it is hard to imagine many faults in IT that aren’t potentially the cause of a future interruption or degradation of service. If we follow this reasoning to its absurd conclusion, every fault is an incident and nothing is a problem).

Perhaps one reason ITIL hangs these “potential incidents” where it does is because of another odd definition: ITIL says a Problem is the cause of “one or more incidents”. What’s odd about that? ITIL promotes pro-active (better called pre-emptive) problem management, and yet apparently we need to wait until something causes at least one incident before we can start treating it as a problem. I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town. (Actually ITIL lost proactive problem management from ITIL V3 but it was hastily restored in ITIL 2011).

Human Eyeball

One of my favourite railroad illustrations is about watching trains. When a train rolls by, keep an eye on nearby staff: those on platforms, down by the track, on waiting trains. On most railroads, staff will stop what they are doing and watch the train – the whole train, watching until it has all gone by. In the old days they would wave to the guard (conductor) on the back of the train. Nowadays they may say something to the driver via radio.

Laziness? Sociability? Railfans? Possibly. But quite likely it is part of their job – it may well be company policy that everybody watches every passing train. The reason is visual inspection. Even in these days of radio telemetry from the FRED (Flashing Rear End Device, a little box on the back that replaces the caboose/guardsvan of old) and track-side detectors for cracked wheels and hotboxes (overheating bearings), there is still no substitute for the good old human eyeball for spotting anything from joyriders to dragging equipment. It is everyone’s responsibility to watch and report: not a bad policy in IT either.

What they are spotting are Problems. The train is still rolling so the service hasn’t been interrupted … yet.

Other Problems make themselves known by interrupting the service. A faulty signal stops a train. In the extreme case the roadbed washes away. We can come up with differing names for things that have and haven’t interrupted/degraded service yet, but I think that is arguing about angels dancing on pinheads. They are all Problems to me: the same crews of people with heavy machinery turn out to fix them while the trains roll by delivering they care not what to whom. Oh sure, they have a customer focus: they care that the trains are indeed rolling and on time, but the individual service levels and customer satisfaction are not their direct concern. There are people in cozy offices who deal with the details of service levels and incidents.
Next time we will return to the once-again sleepy Cherry Valley to discuss the root causes of this accident.

Rob England: Incident Management at Cherry Valley, Illinois

It had been raining for days in and around Rockford, Illinois that Friday afternoon in 2009, some of the heaviest rain locals had ever seen. Around 7:30 that night, people in Cherry Valley – a nearby dormitory suburb – began calling various emergency services: the water that had been flooding the road and tracks had broken through the Canadian National railroad’s line, washing away the trackbed.

An hour later, in driving rain, freight train U70691-18 came through the level crossing in Cherry Valley at 36 m.p.h, pulling 114 cars (wagons) mostly full of fuel ethanol – 8 million litres of it – bound for Chicago. Although ten cross-ties (sleepers) dangled in mid air above running water just beyond the crossing, somehow two locomotives and about half the train bounced across the breach before a rail weld fractured and cars began derailing. As the train tore in half the brakes went into emergency stop. 19 ethanol tank-cars derailed, 13 of them breaching and catching fire.

In a future article we will look at the story behind why one person waiting in a car at the Cherry Valley crossing died in the resulting conflagration, 600 homes were evacuated and $7.9M in damages were caused.

Today we will be focused on the rail traffic controller (RTC) who was the on-duty train dispatcher at the CN‘s Southern Operations Control Center in Homewood, Illinois. We won’t be concerned for now with the RTC’s role in the accident: we will talk about that next time. For now, we are interested in what he and his colleagues had to do after the accident.

While firemen battled to prevent the other cars going up in what could have been the mother of all ethanol fires, and paramedics dealt with the dead and injured, and police struggled to evacuate houses and deal with the road traffic chaos – all in torrential rain and widespread surface flooding – the RTC sat in a silent heated office 100 miles away watching computer monitors. All hell was breaking loose there too. Some of the heaviest rail traffic in the world – most of it freight – flows through and around Chicago; and one of the major arteries had just closed.

Back in an earlier article we talked about the services of a railroad. One of the major services is delivering goods, on time. Nobody likes to store materials if they can help it: railroads deliver “just in time”, such as giant ethanol trains, and the “hotshot” trans-continental double-stack container trains with nine locomotives that get rail-fans like me all excited. Some of the goods carried are perishables: fruit and vegetables from California, stock and meat from the midwest, all flowing east to the population centres of the USA.

The railroad had made commitments regarding the delivery of those goods: what we would call Service Level Targets. Those SLTs were enshrined in contractual arrangements – Service Level Agreements – with penalty clauses. And now trains were late: SLTs were being breached.

A number of RTCs and other staff in Homewood switched into familiar routines:

  • The US rail network is complex – a true network. Trains were scheduled to alternate routes, and traffic on those routes was closed up as tightly bunched together as the rules allowed to create extra capacity.
  • Partner managers got on the phone to the Union Pacific and BNSF railroads to negotiate capacity on their lines under reciprocal agreements already in place for situations just such as this one.
  • Customer relations staff called clients to negotiate new delivery times.
  • Traffic managers searched rail yard inventories for alternate stock of ethanol, that could be delivered early.
  • Crew managers told crews to pick up their trains in new locations and organised transport to get them there.

Fairly quickly, service was restored: oranges got squeezed in Manhatten, pigs and cows went to their deaths, and corn hootch got burnt in cars instead of all over the road in Cherry Valley.

This is Incident Management.

None of it had anything to do with what was happening in the little piece of hell that Cherry Valley had become. The people in heavy waterproofs, hi-viz and helmets, splashing around in the dark and rain, saving lives and property and trying to restore some semblance of local order – that’s not Incident Management.

At least I don’t think it is. I think they had a problem.

An incident is an interruption to service and a problem is an underlying cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes.

To me that is a simple definition that works well. If you read the books and listen to the pundits you will get more complex models that seem to imply that everything done until trains once more rolled smoothly though Cherry Valley is Incident Management. I beg to differ. If the customer gets steak and orange juice then Cherry Valley could be still burning for all they care: Incident Management has met its goals.

Image Credit

Rob England: What is a Technical Service Catalogue?

Amtrak 14th Street Coach Yard (Chicago, IL, US): A railway provides other functions: track gangs who maintain the trackwork, dispatchers who control the movement of trains, yard crews who shuffle and shift rolling stock. It is clear that these are not services provided by the railway to its customers. They are internal functions.

We are looking at railways (railroads) as a useful case study for talking about service management.

Last time we looked at the service catalogue of a railway.

We concluded that first and foremost, a service catalogue describes what a service provider does.

How often and what flavour are only options to a service or package of services.

ITIL refers to a technical service catalogue (TSC).  Where does that fit?

One thing everyone agrees on is the audience: a TSC is for the internal staff of the service provider, to provide them with supplementary information about services – technical information – that the customers and users don’t need to see.

But the scope of a TSC – what services go into it – is a source of much debate, which can be crudely categorised into two camps:

  1. TSC is a technical view of the service catalogue
  2. TSC is a catalogue of technical services

Those are two very different things.  Let me declare my position up front: I believe the answer is #1, a technical view of the service catalogue.  ITIL V3 was ambiguous but ITIL 2011 comes down clearly with #2.  This is unfortunate, as we’ll discuss.

Go back to what a service catalogue is: a description of what a service provider provides to their customers (and hence their users).  A good way of thinking of a service in this context is as something that crosses a boundary: we treat the service provider as a black box, and the services are what come out of that box.  A service catalogue is associated with a certain entity, and it describes the services that cross the boundary of that entity.  If they don’t come out, they aren’t a service, for that entity, depending on where we chose to draw the boundary.  To define what the services are, first define the boundary of the service provider.

Think of our railroad example from last time.  A railway’s service catalogue is some or all of:

  • Container transport
  • Bulk goods transport (especially coal, stone and ore)
  • Less-than-container-load (parcel) transport
  • Priority and perishables transport (customers don’t send fruit as regular containers or parcels: they need it cold and fast)
  • Door-to-door (trucks for the “last mile”)
  • Livestock transport
  • Passenger transport
  • etc etc

A railway provides other functions:

  • track gangs who maintain the trackwork
  • dispatchers who control the movement of trains
  • yard crews who shuffle and shift rolling stock within the yard limits
  • hostlers who prepare and park locomotives

It is clear that these are not services provided by the railway to its customers.  They are internal functions.

A railway provides track, rolling stock, tickets and stations, but these aren’t services either: they are equipment to support and enable the services.

A passenger railway provides

  • on train security
  • ticket collectors
  • porters
  • dining car attendants
  • passenger car cleaners

and a freight railway provides

  • container loading
  • consignment tracking
  • customs clearance
  • waybill paperwork

These all touch the user or customer, so are these services?  Not unless the customer pays for them separately as services or options to services.  In general these systems are just components of a service which the customer or user happens to be able to see.

So why then do some IT people insist that a technical service catalogue should list such “services” as networks, security or AV? (ITIL calls these “supporting services”). If the networks team wants to have their own catalogue of the services that only they provide, then they are drawing their own boundary around just their function, in which case it is not part of a technical service catalogue for all of IT, it is a service catalogue specifically for the networking team.  It is not a service provided by IT to the customer.

A technical service catalogue should be a technical view of the same set of services as any other type of service catalogue for the particular entity in question.   The difference is that it provides an internal technical view of the services, with additional information useful to technical staff when providing or supporting the services.  It includes information a customer or user doesn’t want or need to see.

A technical service catalogue for a railway would indeed refer to tickets and porters and stations and yard procedures and waybills, but only as components of the services provided – referred to within the information about those services – not listed as services in their own right.  I’m all for “supporting services” to be described within a service catalogue, but not as services.  They are part of the information about a service.  Supporting services aren’t services: they are component systems – CIs – underpinning the real services we deliver to our customers.

By adopting the concept of “supporting services” and allowing these to be called services within the catalogue of a wider entity that does not provide these services to a customer, ITIL 2011 contradicts its own description of “service”.

Service Design 4.2.4.3 says:

Supporting services IT services that support or ‘underpin’ the customer-facing services.  These are typically invisible to the customer… such as infrastructure services, network services, application services or technical services.

Yet the in the prior section 4.2.4.2, such IT systems are clearly not a service:

IT staff often confuse a ‘service’ as perceived by the customer with an IT system.  In many cases one ‘service’ can be made up of other ‘services’ and so on, which are themselves made up of one or more IT systems within an overall infrastructure…  A good starting point is often to ask customers what IT services they use and how those services map onto and support their business processes

And of course it contradicts the generic ITIL definition of a service as something that delivers value to customers.  This is important because the concept of “supporting service” allows internal units within the service provider to limit their care and concern to focus on the “supporting service” they provide and allows them to become detached from the actual services provided to the customer.  There is no SLA applicable to “their” service, and it quite likely isn’t considered by service level reporting.

A railway ticket inspector shouldn’t ignore security because that is not part of his ‘service”.  A yard hostler should make sure he doesn’t obstruct the expeditious handling of rolling stick when moving locomotives, even though rolling stock isn’t part of “his” service.  The idea of “supporting service” allows and encourages an “I’m alright Jack” mentality which goes against everything we are trying to achieve with service management.

It is possible that Lou Hunnebeck and the team writing Service Design agree with me: that they intend there to be a distinction between supporting services and IT systems.  If so, that distinction is opaque.   And they should have thought more about how the “internal” services model would be misused- the problem I’m describing was common before ITIL 2011.

There is the case where the supporting services really are services: provided to us by a third party in support of our services to our customer.  For example, a railway often pays another company:

  • to clean the carriages out
  • to provide food for the bistro car;
  • to repair rolling stock
  • to provide the trucking over “the last mile”

Where we bundle these activities as part of our service to a customer and treat them as an Underpinning Contract, then from the perspective of the services in our service catalogue – i.e from the perspective of our customer – these are not services: they are CIs that should not be catalogued here.  If this – and only this scenario – is what Service Design means by a “supporting service”, I can’t see that called out explicitly anywhere.

Technical service catalogue should be a technical view of the services that we provide to our customers.  I wish ITIL had stuck to that clear simple model of a catalogue and kept IT focused on what we are there for.

Photo Credit

Rob England: "What is Service Management?"

Tenuous link: One of Rob's passions outside of ITSM is trains. The ITSM Review offices are in sunny Swindon in the UK, home of Isambard Kingdom Brunel's workshops which powered the Great Western Railway.

Editor’s Note: We are very pleased to welcome Rob England (a.k.a The IT Skeptic) as regular columnist at The ITSM Review.

Service Management

Railways provide a useful analogy for understanding what service management is and how it works.

What is a railway for? (or “railroad” for our American readers)

If you said “to move people and/or goods” you are only partly right.  On the right track (pun intended) but not there yet.

How should it move goods and passengers?  With maximum quality?  Or at minimum cost?  The answer to that is “it depends”.  It depends on what the customer wants.

A customer is one who pays for the service of the railway.  That isn’t always the same as the one who buys the ticket or books the freight.  Many railways receive public funding, so the government or other body is effectively also a customer: they are paying for part of the service.  Not all customers are users of the services.

The railway is answerable not only to its customers.  It is also answerable to its owners and the governors they delegate authority to.  The owners may not want the same things as the customers at all.  For example, railways are often required to provide a passenger service as a requirement of gaining the right to operate.  These passenger services are often unprofitable: the money is in the freight services.  Guess how often such passenger services meet the needs of the paying ticket-holders.

So a railway exists to provide a service that moves people and/or goods to meet the needs of its governors and customers.

Your are in the service business

If you were operating a railway, what activities would you have to manage in order to ensure you meet the needs of your governors and customers?  There would be some activities that are unique to railways, such as scheduling, servicing rolling stock, dispatching trains and so on.  But the bulk of the activities involved in operating a railway are the same as operating any business: reporting, financials, HR, marketing, IT, procurement… and delivering your services.  It doesn’t matter whether an organisation’s services are transporting goods, providing accommodation, building houses or catching fish.  They all serve customers and they all perform a similar set of activities to manage that service.

Whether you build roads or map them, operate ports or use them, build houses or sell them, plan weddings or sing at them, care for kids or clothe them, sell PCs or scrap them, you are in a service business, even if you may not be in a “service industry”.

We aren’t talking about over-the-counter “may I help you?” service, how to develop the customer service interface, the experience of contact.  Service Management is about the end-to-end process of providing services.  It covers such things as:

Service management activities Rail examples
Delivering Executing a service for users Food service, engine drivers, shunters
Operating running the infrastructure that makes the services work Signaling, track maintenance, security guards
Supporting Responding to user requests for service or help, and resolving them Ticket sales, call centre, guard, repair crews
Cataloguing Providing information about what services are available Timetables, websites, brochures
Customer relations Maintaining relationships with customers Customer account managers, sales, public relations
Measuring Monitoring and reporting service metrics Punctuality, traffic volumes, profitability
Planning Proposing, choosing and strategising new services, improvements and retirements Routes, trains, schedules, freight deals, specialised cars e.g. refrigerated)
Designing How the service will work, what infrastructure it needs Developing  anew schedule, specifying new equipment
Building Creating the infrastructure, mechanisms, and processes to deliver a service Ordering or constructing new rolling stock, laying track, hiring and training staff, printing collateral
Implementing Rolling out the new service, going “live” Commissioning new rolling stock, publishing new or changed schedules, deploying staff, rolling trains
Assuring Protecting the organisation, its staff, customers and users.  Making sure the service is safe for people, compliance and profits. Track safety programmes, risk register, ticket inspection, financial and quality audits
Improving Making service better: identifying, planning and managing improvement to efficiency and effectiveness Quality programme, cost control, regular maintenance schedules
Governing Direct, monitor and evaluate the management and execution of the services Corporate vision and goals, high-level policy, risk profile, annual report

Service Management says the most important thing you do is deliver services to your customers.  Moreover, everything you do should be considered in terms of the services you provide to your customers.

‘Outside-In’ Thinking

Adopting a service management approach can have a profound affect on the way your business works and your staff think.  It takes us away from that introverted, bottom-up thinking that begins with what we have and what we do and eventually works its way up and out to what we deliver to the customer.  Instead, with service management we change our point of view from concentrating on the internal “plumbing” of our business, moving instead to a focus on what “comes out of the pipe” – what we provide.  We take an “outside-in” view.  Starting from this external perspective we then work our way top-down into the service organisation to derive what we need and what we have to do in order to provide that service.

Service management isn’t one subset of the business; it is not one activity at the end of the main supply chain.  It is a different way of seeing the whole supply chain, the whole business that produces the services, by seeing it initially from the outside, from the customer’s point of view.  Therefore any discussion of Service Management may stray into general business management topics.

Seeing our business in terms of the services it provides can’t help but make us better at providing them.

To a customer, “better” means more useful and more reliable, i.e. more valuable and better quality.  

From the service-provider’s point of view, “better” means more effective and more efficient, i.e. better results and cheaper.  

Follow along in this series of articles as we look at Service Management through the lens of railways and how they operate.  We hope it will provide a fun and useful way to understand this thing called Service Management.

© Copyright 2012 Two Hills Ltd.

'Basic Service Management' by Rob England (a.k.a The IT Skeptic)

Basic Service Management by Rob England

This is a quick review of Rob England’s book ‘Basic Service Management’.

You can find out more about Rob’s book and the TIPU method here: www.basicsm.com. If you want to share your own review please add a comment below.

In my opinion this is a well written introduction to service management.

This book might have also been called:

  • ‘Service Management in a nutshell’
  • ‘An introduction to Service Management’
  • ‘Service Management for Business Owners’
  • ‘The book on Service Management that you buy for your boss’ or
  • ‘How to introduce someone to service management without scaring the bejesus out of them by banging on about ITIL or other IT geekery’

I read this in one sitting and I’m not a fast reader. It is quick, accessible and thought provoking.

It is not an ITSM or IT book per se, in fact I think the best recipient of this book is a non-IT business owner or service owner who wants to appreciate the benefits of service management.

As an ITSM professional, this is the sort of book you need to send to those you wish to educate and influence about your chosen profession. Or as one Amazon reviewer put it: “I recommend reading it before you get lost in ITIL”. This would also be useful to an entrepreneur looking to start or scale their business.

Why Service Management?

“If you are reading this book, you probably don’t manage your services so much. That gives you an opportunity to increase revenues and profitability: improving your service brings increased efficiency and effectiveness. That means increased returns for much less investment than from improving your products or equipment”.

Rob England, The IT Skeptic

Rob is a great wordsmith and well respected in the ITSM industry – my only criticism of this book is that I wish he had used the power of metaphor, story telling or examples to describe his seven practice areas. The second half of the book tends to slide into a glossary of his basic service management terms and bullet points. I thought this might have been a perfect opportunity for Rob to use some examples in order to reinforce his message and walk the reader through his ‘Seven Areas’ rather than explaining principles in purely theoretical terms.

In the ‘How to Use this Book’ section Rob urges the reader to “Read it, It is short”. In a similar fashion my advice to you as an ITSM professional is, “Buy it, it is good”.

Have you read Rob’s book? Please share your opinion in the comments below.

Links: