Problem management challenges and critical success factors

Following his presentation on “problem management challenges and critical success factors” at the 8th annual itSMF Estonia conference in December, Tõnu Vahtra, Head of Service Operations at Playtech (the world’s largest publicly-traded online gambling software supplier) gives us his advice on understanding problem management, steps to follow when implementing the process, and how to make it successful. 

Tõnu Vahtra
Tõnu Vahtra

Problem management is not a standalone process

Incident management and event management

It cannot exist without the incident management process and there is a strong correlation between incident management maturity and problem management efficiency/results. Incident management needs to ensure that problems are detected and properly documented (e.g. the basic incident management requirement that all requests need to be registered). Incident management works back-to-back with the event management process, if both of these processes are KPI managed then any anomalies in alarm or incident trends can be valuable input to problem management. Incident management also has to ensure that in parallel to restoring service during an incident it has to be ensured that relevant information is collected during or right after resolution (e.g. server memory dump before restart) so that there would be more information available to identify incident root cause(s).

Critical incident management

Problem management at Playtech gains a lot from the critical incident management function, which is carried out by dedicated Critical Incident Managers who have the widest logical understanding of all products and services and years of experience with solving critical incidents. They perform incident post mortem analysis following all major incidents, and they also start with initial root cause analysis (RCA) before handing this task over to problem management. RCA is handed over to Problem Managers within 24 hours from incident end time during which the Critical Incident Manager is collecting and organizing all information available about the incident. Critical Incident Managers usually do not have any problems with allocating support/troubleshooting resources from all support levels as critical incident troubleshooting and initial preventive measures are considered the highest priority within the mandate from highest corporate management. All the above ensures high quality input for problem management on a timely manner.

Change management and knowledge management

In Error Control phase the two most important processes for problem management are change management and knowledge management. Most action items identified during RCA are implemented through change management, the stronger the process the less problem management has to be involved directly in change planning (providing abstract goals VS concrete action plan or task list for implementation) and the smaller the risks of additional incidents during change implementation. Change management also needs to have the capability and documented process flow to implement emergency changes in an organized way with minimum impact to stop reoccurring critical incidents as fast as possible.

Knowledge management is vital for incident management for ensuring that service desk specialists would be able to quickly find and action specific workarounds for known errors until their resolution is still in progress by problem management. Regular input and high attention is needed from problem management to ensure that every stakeholder for known error database (KEDB) would be able to easily locate information relevant to his/her role, all units would be aware of information relevant to them and that all the information in KEDB would be relevant and up to date. In Playtech problem management is also managing process errors identified from root cause analysis and process improvements only last when properly documented, communicated to all relevant stakeholders and additional controls are put in place to detect deflections from optimal process. Local and cross-disciplinary knowledge management for process knowledge has an important role here.

Defect management

Problem management has to go beyond ITSM processes in a software development/services corporation like Playtech and also integrate to software development lifecycle (SDLC). For this purpose in Playtech a separate defect management sub-process has been established under problem management. Defect management is managing the lifecycle of all significant software defects identified from production environments and aligning defect fixing expectations between business and development departments. Defect Managers ensure a consistent prioritized overview of all significant outstanding software defects, which warrants optimal usage of development resources and minimizes overall business impact from defects. They act as a single point of contact for all defect related communication and ensure high transparency of defect fixing process and fix ETA’s. Defect Managers define the defect prioritization framework between business and development key stakeholders and govern the agreed targets.

Software problem management

Problem management is leading the software problem management process through defect management. Under the software problem management process (which is usually being ran by a quality assurance team in relevant development units) development teams are performing root cause analysis for defects highlighted for RCA by problem management or raised internally. Every defect is analyzed from two aspects: firstly why the defect was created by development and secondly if the defect was created then why was it not identified during internal QA and reported from production environment first. Root causes and action items are defined from both questions and tracked with relevant stakeholders. This process ensures that similar defects will not be created or will be identified internally in the future. Even more importantly there is a direct feedback channel from the field to the respective developer or team who created the defect so that they get full understanding of the business implications in relation to their activities.

Important steps to take problem management to the next level

The problem management unit has to become more proactive, to get more involved in service design and service transition phases to identify and eliminate problems before they reach production environments. Problem management needs resources to accommodate contributing to pre-production risk management and even more importantly this involvement has to be valued and enforced by corporate senior management as it may take additional resources and delay time-to-market in some situations.

The Problem Management Team itself can get more resources for proactive tasks by reducing their direct participation in reactive Problem management activities. This has to be done via advocating the Problem management mindset across the entire corporation (encouraging people to think in terms of cause and effect with the desire to understand issue causes and push their resolution for continuous improvement) so each major domain would have their Problem Coordinators and identify root causes/track action items independently and problem management could take more a defining and governing role. To assert the value created from problem management and enlist more people to spread the word about problem management ideas for them to go viral, it is essential to visualize the process and explain the relations between incidents, root causes and action items to all stakeholders for them to understand how their task is contributing to the bigger picture.

There is a high number of operationally independent problem management stakeholders in Playtech and implementing KPI framework that would be fit to measure and achieve problem management goals and be applicable to all major stakeholders individually and cross stakeholders seems almost impossible a task. The saying ”You get what you measure“ is very true in problem management and no stakeholder wants to be measured by problems that involves other stakeholders and are taking actions to remove such problems from their statistics instead of focusing on the problem and its solution. At the same time problem management tends to be most inefficient and difficult for problems spreading across multiple division. A Problem Manager’s role and assertiveness in facilitating a constructive and systematic process towards the resolution of such problems is crucial. And still problem management needs to find a creative approach to reflect such problems in KPI reports to present then as part of the big picture and sell them to executive management to get their sponsorship for major improvement tasks that compete with business development projects for the same resources while the latter has a much clearer ROI.

No problem exists in isolation and the problem records in KEDB can be related to specific categories/ domains and also related hierarchically to each other (there can be major principal problems that consist of smaller problems), also specific action items can contribute to the resolution of more than one problem. Problem categories cannot be restricted to fixed list as it can have multiple triggers and causes, it should be possible to relate a problem record to all interested stakeholders, for this dynamic tagging seems to be a better approach than limited number of categories (for example list of problems that are related to a big project). Instead of looking into each problem in isolation each problem should be approached and prioritized in the right context fully considering its implications and surroundings. No ITSM tool today provides the full capabilities for problem tagging or creating the mentioned relations without development, not to mention the visualization of such relations that would be a powerful tool in trend or WHAT-IF analysis and problem prioritization. Playtech is still looking for the most optimal problem categorization model and the tool that would enable the usage of such model.

Advice to organizations that are planning to start the implementation of the problem management process

For organizations starting the implementation of problem management process  my advice is don’t take all the process activities from the ITIL book and start blindly implementing them, this is not the way to start the implementation of this process or any other. Problem management success depends mostly on a specific mindset and in an already established organization it may take years for the right mindset to be universally accepted. Problem management formal process should be initially mostly invisible to all the stakeholders outside of the Problem Management Team to avoid the natural psychological tendency to resist change.

It is essential to allocate dedicated resources to problem management (Playtech assigned dedicated person to problem management in 2007, and any problem management activities prior to that were ad-hoc and non-consistent). The problem management unit should start from performing root cause analysis and removing the root causes of present major incidents that have the highest financial and reputational impact on the organization. If such incidents are being closely monitored by senior management and key stakeholders, solving them can earn the essential credits for problem management to get attention and resources for solving problems elsewhere. Secondly problem management should look at the most obvious reoccurring alarm and incident trends that result in a high support/maintenance cost. By resolving such problems they gain the trust of support and operational teams whose workload is reduced and they are more willing to contribute and cooperate in future root cause analysis. Problem final review before closure is a task often neglected but to improve the process it is essential to assess if the given problem was handled efficiently and to give feedback about problem solution to all relevant parties. Proactive problem management or KPI’s are not essential to start with and Problem Managers should concentrate on activities with highest exposure and clear value.

In summary

There will definitely be setbacks in problem management and in order to make a real difference with this process and increase the process maturity over time it has to have at least three things. A strong and assertive leader who is persistent in advocating the problem management; a continuous improvement mindset throughout the organization; and the ability to find a way forward from dead-end situations with out of the box thinking. When there is no such leader then involving external problem management experts may also help as a temporary measure to get the focus back on the most important activities. However, this measure is not sufficient in the long-term as the problem management process constantly needs to evolve with its organization and adjust with significant operational changes to be fit for purpose and remain relevant.

You can download Tõnu’s presentation in full here.

itSMF Estonia Conference Round-up

Christmas Tree small
Beautiful Estonia

On Wednesday 11th December, in a very cold and snowy Tallinn, President of itSMF Estonia, Kaimar Karu kicked off the annual itSMF Estonia conference by introducing all of the speakers and encouraging delegates to ask questions of them throughout the day.

Kaimar had managed once again to raise attendance of the conference (by 10%), with representation from 10 different countries, and with a very good female representation in the audience too.

Delivering Service Operations at Mega-Scale – Alan Levin, Microsoft

Alan Levin small

First speaker was Alan Levin of Microsoft whose presentation talked through how Microsoft deal with their vast number of servers and how, built into all of Microsoft products, is the ability to self-heal.

On the subject of Event Management Alan spoke about ensuring that alarms are routed to the correct people and how, in your business, any opportunity you have to reduce alerts should be taken.

Enabling Value by Process – Viktor Petermann, Swedbank

Viktor Petermann small

Viktor opened his presentation by saying that 4 out of 5 improvement processes fail because people are not robots. You cannot just expect them to know what you want and how you want things to work.

He continued by saying that having the right culture, processes and learning from relevant experiences will enable you to do the right things the right way.

Viktor warned that like quitting smoking, change will not happen unless you really want it to.  Before embarking on any change make sure that you are willing to give it 100%.

Oded-Moshe-small
Oded Moshe

Benchmarking and BI, Sat Navs for Service Desks – Oded Moshe, SysAid Technologies Ltd.

After having to rest his voice for 24hrs due to contracting the dreaded man-flu Oded still managed to show how to use Benchmarking to improve your Service Desk.

His presentation contained useful guidance on what areas to look at and how to benchmark yourself against them.

He also explained how you can use SysAid and it’s community to gather global service desk metrics to measure yourself against.

Presentation words of wisdom from Oded: Don’t become fixated with metrics and benchmarking as they are not the only way to measure.

Service-Based Public Sector – Janek Rozov, Ministry of Economic Affairs and Communications

Janek Rozov small

In contrast to the other presentations “Service-Based Public Sector” was presented in Estonian.  Although I do not speak Estonian I could tell how passionate Janek was about the subject and it was one of the most talked about presentations that evening in the bar.

The presentation covered how the Ministry of Economic Affairs and Communication are using ICT to fulfill their vision of supporting Estonians as much as possible, while they are using their rights but bothering them as little as possible in the process. Perhaps we could pay for Janek to spend some time with the UK Government in the hopes that some of this common sense might rub off?

If you would like to know more about Estonian ICT success in the public sector you can read Janek’s pre-conference article “Standardizing the delivery of public services”.

Service Desk 2.0 – Aale Roos, Pohjoisviitta Oy

Aale Roos small

Aale spoke profusely about how service desk’s and the mentality of “break fix” is old fashioned and flawed.  He described how the service desk needs be brought kicking and screaming into the 21st century, concentrating on proactive measures and outcomes.

He continued to say that ITIL has been outdated for over a decade and that unlearning ITIL and moving to a “Standard + Case” approach is the way of the future.

Networking

There was lots of opportunity for networking across the event, and at lunch I got the opportunity to speak to a few of the delegates and presenters to find out what they thought of the conference.

Quote from Oded Moshe:

I think the first session by Alan Levin from Microsoft was a great chance for us all to see the insides of one of the largest operational support organizations in the world! They are in charge of providing more than 200 cloud business services to more than 1 billion people with the help of more than 1 million servers. So Problem Management, Incidents, Monitoring – everything is on a HUGE scale – it is easy to understand why you must have your service processes properly tuned otherwise you are in a master-mess…

Peter Hepworth – CEO of AXELOS, Kaimar Karu – President of itSMF Estonia and Patrick Bolger – Chief Evangelist at Hornbill Service Management
Peter Hepworth – CEO of AXELOS, Kaimar Karu – President of itSMF Estonia and Patrick Bolger – Chief Evangelist at Hornbill Service Management

Industry Leaders Agree IT is Revolting – Patrick Bolger, Hornbill Service Management

Pat Bolger small

Adapt or die was the message in Patrick’s session with references to high street names that didn’t and paid the price.

Comparing how we in IT think we are viewed and how the business actually views us was sobering but mentions of SM Congress and Arch SM show that the industry is ready to change and we are not doing this alone.

Problem & Knowledge, The Missing Link – Barclay Rae, Barclay Rae Consulting

Barclay small

Presenting on the missing links in ITSM, Barclay hammered home why Problem and Knowledge Management are so fundamentally important.

Using his ITSM Goodness model Barclay showed how to move away from the process silo’s we so often find ourselves in and which processes to group together for maximum effectiveness i.e. Incident, Problem, Change.

Barclay also held well-attended workshops pre-conference in conjunction with itSMF Estonia.

DevOps, Shattering the Barriers – Kaimar Karu, Mindbridge   

Kaimar small

Kaimar’s message is unorthodox:  Don’t play it safe, try to break things, don’t mask fragility and plan for failure, for this is the road to increased quality and innovation.

He advised that we need to not forget that developers are human and not unapproachable cowboys riding round on horses shooting code.  Get to know them over a drink so that everyone can relax and say what’s on their mind without the fear of repercussion.

But most of all remember that “Sh*t happens”.  Stop the blame, it doesn’t help…EVER.

Problem Management Challenges and Critical Success Factors – TÕnu Vahtra, Playtech

Tonu Vahtra small

The penultimate session of the day was from TÕnu on how Playtech are working through Problem Management and the issues they have encountered.

The major difficulties TÕnu has found is the lack of practical information on how to actually do Problem Management, and Playtech have found themselves having to teach themselves learning from their own mistakes as they go.

It was a very useful case study with helpful pointers to information and literature such as Apollo Route Cause Analysis by Dean L Gano for others struggling with Problem Management.

The Future for ITIL – Peter Hepworth, AXELOS followed by Forum

Axelos Workshop small

Following on from the publication of AXELOS’ roadmap, and the announcement that they would be partnering with itSMF International, Peter talked through the progress AXELOS has made and its hopes for the future.

The forum was well attended and many useful suggestions were made for ways that ITIL and PRINCE2 could be improved.

You can learn more about AXELOS’ plans by reading our interview with Peter.

My thoughts

Considering the cost of a ticket to the conference I wasn’t expecting the content and presentations to be at the very high level it was.  I haven’t yet attended any of the other non-UK itSMF conferences but the bar has now been set incredibly high.

My main observation from the conference and the discussions that took place after is that the majority of delegates knew how very important Problem Management is, but are still struggling with implementation and making it work.  In the AXELOS workshop the main feedback seemed to be the need for ITIL to cut down on the number of processes available as standard and concentrate on the core areas that the majority of organizations have, or are trying to put in place.

Well done to Kaimar and team for the fantastic job and thank you for the wonderful hospitality. In addition to the conference I particular enjoyed the entertainment on the Tuesday evening, when some of the organisers, speakers, delegates and penguins ventured out in the snow for some sightseeing and a truly delicious meal at a little restaurant called Leib in the Old Town.

I highly recommend to anyone to attend the itSMF Estonia 2014 conference next December. With flights from most places in Europe less than £150, a hotel/venue that is less than £100 per night, and an amazing ticket price of less than £40, it is extremely great value for money. With outstanding content (90% in English), brilliant networking opportunities and excellent hospitality, it’s too good of an event to miss. I certainly look forward to being there again.

As a final note, thank -you to itSMF Estonia for having us involved as the Official Media Partner.  We are hoping to work with other international itSMF chapters in 2014, as well as on other worldwide ITSM events.  Watch this space 🙂