Do you really think you can ‘Run your data center from an IPhone’?

Article from information week:

The title implies you can run your data center from a mobile device—but rather what they are talking about is a vendor that has built a tool that enables a mobile way to restart services or servers ‘remotely’ that would be simple and convenient-giving you complete access to all your systems running in the data center.  Although just Microsoft at the moment, they apparently envision going much farther into other technologies.

When using the term ‘remotely’ I’m talking about an individual not being at the location that they normally perform their work.  We have had engineers supporting data centers from remote locations for years.  However, they are generally at a desk and have access to various monitors and systems.  So as we say remotely, I’m picturing that Sr Engineer sitting with their IPad at their child’s soccer game.

Actually running a data center has a multitude of aspects that can’t be performed remotely (staffing, vendor coordination, equipment issues, etc.).  However, this is a ‘cool’ aspect of technology that implies that you can do ‘some’ things from where ever you are—and I agree there is definitely some benefit for being able to see and execute certain things remotely.

As an aside, from my experience, I am always pushing to have staff work very hard when they are on the clock and when it’s time for personal time-I like them to focus on their families and personal life.   When there is a critical issue, yes we need to engage staff and possibly have them leave their family barbecue.  But there is a work life balance that needs to be in your planning and if it is not, your people are prone to make mistakes.

For example, when your Sr Engineer is at their child’s soccer game and there is 3:48 seconds left in the game and the team is down by 1 and his child is driving down the field to score-- and they get an urgent alert and they immediately decide to restart a server instead of just a particular service because they aren’t concentrating on the task as they normally would—thus negatively impacting the situation.  This is life and we need to build processes that enable people to be away from work.  Again, if it’s an urgent situation, then people talking through the situation while focusing on the task at hand is important to approach to the right resolution.

I see a greater value to jump on a conference call and talk thru situations when people are remote because they may be not as focused until you get them on the phone.  You’d be surprised how many outages occur because someone is hurrying to get back to the game or get to a family commitment so they just ‘do it’, ‘push the button’, ‘initiate the upgrade’, etc. when they haven’t taken the time to validate.

On the other hand, if the solution to a particular problem is to just restart a service or server then build that into your automation scheme and send an informational alert that it occurred and that the engineer should follow up at their convenience--after the soccer game or when they get back to their desk. 

Your thoughts?
An excellent vehicle to stimulate improvement in the organization is something I call an Event Review.   The objective is to extract learning’s and your own best practices that stem from good things or bad things that have happened.  Then share those learning’s with everyone for the benefit of the organization.

The ‘Event’ could be anything:  A project, an outage, a deployment, etc. but most important is the environment or culture you establish.   You need to reinforce the benefits of the learning’s.  When it is a positive event you are reviewing, people like to be recognized-but do it live and as quickly as possible—don’t wait for an event review. 

The ‘Review’ itself should be held as soon as possible after the event has occurred.   In the case of an outage, I recommend within 48 hours as details get forgotten the longer you wait.  The key here in the timeliness is that you catch the important details that are still fresh in people’s mind.  The longer you wait, the more that is lost. 

You may find you don’t have enough time in the day to perform all the reviews you’d like so you’ll need to prioritize.  Focus on what is necessary for your organization at ‘this time’.  Adjust as you mature and as time permits.  Start slow and make each one meaningful. 

Many will be weary that your event review is a witch hunt and it may take time to build up the trust.  However, to reinforce that trust you must separate the event review from any management issues that you may need to deal with, such as people not following policy.  To get to the benefits, it will take time to build up the required trust within your staff and you must respect the process and be patient.  Focus is on ‘what’ went wrong, ‘what’ went right and ‘what’ can be done to improve or repeat the success.   The ‘what’ is important and avoid the ‘who’ during the event review--especially when you are reviewing an outage or event that had a negative impact on the business or the organization!

Assign a facilitator to lead the sessions-generally someone from your problem management staff or management team.  Invite anyone that had direct involvement in the event.   It could include IT staff as well as non-IT staff. 

The facilitator should prepare for the session by assembling as much information as possible.  This could include the timeline, decision points, start time, end time, etc.  This will help keep the discussion on topic.

There are some people that should not be invited to the event review and that is primarily the ‘brass’ of the organization.  If you really want staff to speak up then the ‘safe environment’ should be void of senior management.  Even the best of the senior staff seem to want to chime in or over react and it will send unwanted signals to the participants.  If the objective of the senior management is to provide praise, then they should do it elsewhere—not in the event review!

There is often extreme pressure from some senior staff members that may even demand to be in attendance.  If that occurs, then you have an opportunity to clarify the objective of the event review and separate out any objectives that the senior staff member may have.  You can provide them a ‘separate’ session and review the output of the event review—work to maintain the event review for just those that were directly involved.  At the same time, respect the needs of your senior management—it’s the ‘wants’ that you need to manage.  And yes, there have been many times that I have been in that situation and almost fearing my own job by not allowing my boss to attend an event review.  And I remember a number of times being requested for a name because someone’s head had to roll.  Again, we need to separate the management issues from the event review process. 

Begin by documenting the timeline of actions/events related to the event.   What steps were taken leading up to the event?  Was the plan followed?  What validation occurred?  Literally document all activities leading up to the event as well as during the event and the steps that closed the event.   In the case of outages, often time is spent during the restoral of service to try and re-engineer something or fix something else that delays the restoral of service.  Flush those activities out during the review.

Once you have the timeline documented, you now focus on 3 key things:  1-Do we know ‘what’ was the root cause of the outage.   2- ‘What’ can be done to prevent the outage from happening again.  3-If it does happen again, ‘what’ can be done to resolve it quicker?  

After you have documented these areas, specific action items for each area must be captured with specific timeframes and owners and a process to follow up on the action items must be in place.  If you leave the event review without action items, you will not see the improvements.  The action items must be meaningful and have a documented benefit.  It is better to have fewer quality action items that can be achieved.  Giving someone an unreasonable action item that can’t be completed will mean you’ll get a lot of bad reports on status.  Follow and close out all your action items—it will be a negative reflection on your management team if action items continue without closure.

A final item to cover in the agenda is key learnings from the event.  This is best done at the end of the session.  You should open this topic with an overall summary of the event based on the prior agenda items and then step back and open up the discussion for some observations from the group.   This dialogue is often the most beneficial as it can provide great insights into the group and what they really learned from the event.  Don’t interrupt!  Let the participants speak and engage each other—just observe and then document the summary of the key learnings!

Always stick to the agenda and keep the discussion focused at the right level.  Try and keep the session crisp and to 1 to 1 ½  hours.  Separate meetings can be used to drill down on further analysis if required or designing solutions to re-engineer something.  That activity should not be done in the event review but rather, can be assigned as an action item.

I can’t say enough about the importance of building the foundation of the event review around the ‘what’ rather than the ‘who’.  When you use this review to zero in on the ‘who done it’ you will lose the openness and candor so important for continual process improvement.   Follow up with individuals is necessary, just be careful not to use the event review process as the vehicle.  Otherwise your ‘event review’ will be perceived as a visit to the principal’s office and that means people stop talking and only tell you what they think you need to know.

Another negative factor will occur when people feel they will get in trouble if they are mentioned in the event review is that they will spend more time pointing fingers and doing what they can to ensure the finger isn’t pointed at them.  Again, another unproductive activity that brings negativity into your environment and must be replaced with a true focus on the business!

The event review is truly a powerful tool that can help you mature your organization.  To do that you must prioritize the long term benefits over the short term witch hunts! 

If you are the person that gets the calls from the business, the CIO, the CEO or any other senior executive when something isn’t working you are going to want a solid change management process backing you up.  Although all processes are important, what differentiates a mediocre IT organization and a very efficient and effective one is the maturity and efficiency of their change management process.

I want to drill down in detail on this topic because it represents the health and strength of your organization.  This is all about respect.  Respect for peers, the company, the environment and the leadership.  You will not always be present when people will need to act.  How they act is a direct reflection on the leadership of the organization.   The goal is to heighten the level of respect so that when mistakes happen, they are just mistakes.

Each change faces these questions:  Why should we make the change?  Who needs to know?  Who’s impacted?  What’s the benefit?  What’s the risk?  What’s the cost?  When should we do it?  Who should do it?  What precautions should we take?  What training is required?  How do we support the change?  What if this doesn’t go as planned?   When the people in the organization maintain a high level of respect, they will ensure that those questions get answered for you.

Another key aspect beyond respect is that the process owner of change management needs to maintain the vision of change management:  That the primary objective in change management is to address the needs of the business:  Help the business be responsive to the market and increase revenue and profit.  Changes will need to be made to update tools, update applications, implement new applications, update equipment, test some assumptions, try new things, etc.  The change management process is there to help make these changes in a way to maximize availability during the process of transition, perform the work expeditiously and ensure proper knowledge exists in order to ensure supportability.

From a simplified view, these are the basic steps to change management? 

1.       Understand the business value of the change and collaborate appropriately to obtain buy-in

2.       Identify who is impacted by the change and ensure appropriate coordination occurs

3.       Schedule when the change will go in

4.       ‘Plan’ out how the change will be deployed and what the back out plans are in case the change doesn’t go right

5.       Test the change

6.       Deploy the change

7.       Validate that the change works as intended

8.       Updating appropriate documentation/system configuration/users 

Those 8 steps are simple enough and, at times, they can all be completed within a very short time.   Some steps can be omitted as in the case of simply replacing a redundant hard drive in a rack.  However, always use discretion as you determine when and what steps you are skipping.  The reason you would skip a step is because the step is not adding any value, rather than you don’t have time to perform the step. 

Often I see environments that have changes being made in the production environment and the only person aware of those changes are the people making the change.   This becomes apparent as outages occur and you eventually identify that the change was the problem.  This can occur when you have an overdesigned process, or an infrastructure group that is not responsive, or ‘change management’ has been distributed to various groups.    

When you find yourself in the situation where not everyone is following your change process, stop and simplify.  It’s more important to have a simplified process that everyone uses and all are aware of the changes being made than it is to have a process perfectly documented, perfectly engineered with numerous checkpoints that no one is following!!