By Terry Vergon
Director, Engineering and Maintenance
5 Paths to Preventing Downtime.
Having worked in mission critical facilities for 30-plus years, I have seen my fair share of human-caused downtime. I’ve studied and investigated many incidents and can safely say most of these fall into five major groupings. And I know they are all preventable.
Errors in communications.
Communication is tough, and verbal communication is even tougher. If you ask the developers working on Siri, Dragon or any of the other speech recognition software, they—ll tell you how difficult it is. Pronunciations, vernacular, local slang and the vast variety of meanings all make the spoken word complex, and at times confusing and misunderstood.
My own personal experience began when I was in the Navy and reported to duty on my first submarine. It took me at least a couple of weeks to understand the announcements made over the ship’s PA system. Acronyms, local designations and rapid announcing made it all very confusing and hard to understand. I quickly noticed that for those who had been on board for some time, the announcements became like white noise that blended in with the background—something that could be dangerous, especially in the event of an emergency.
As an example, how many times have you listened to the radio and heard the lyrics to a song, singing them over and over again in your head, only to one day actually read them and come to realize they are entirely different from what you had heard? Our minds often play tricks on us, and many times it is that we simply hear what we expect or want to hear.
For more on this check out Karen Chensausky’s research Hearing What We Want to Hear in the MIT Technology Review. Now couple that with unclear communications such as using letters in verbal operational orders—such as using letters like “C”, “B”, and “D”within spoken operational orders and you start to appreciate the complexities that we interject into our communications. How we communicate can add risk to our operations.
How to prevent: Develop, train, implement, and enforce a formalized spoken communication protocol with mandatory repeat-backs. Eliminate confusing designations; use a phonetic alphabet for your site’s letter designations. Provide all personnel with list of authorized abbreviations for use at the site. Leaders must enforce and become examples of this policy and practice.
Inattention to details.
Inattention can be caused by many things — fatigue, emotional state, intoxication, medical condition, local distractions (sounds, sights, other personnel, etc.). Ever been driving a car and felt your head snap up as you were driving late at night and realize that you just took a “nap”at 70 mph? Has your mind ever wandered during a conversation to where you literally can’t remember what was just said and have to ask the speaker to repeat themselves for you?
Our minds naturally wander. It’s what we do. Our brains process information much faster than is being received. This gives the brain extra time to access memories, process relationships, and try to make sense of what it just received. Sometimes this processing activity can take control and we “daydream”or “lose focus.” Whatever you call it, it can cause the brain to focus on something other than that which is critical for proper operation, communication, or observation. However momentary it may be, it can create real problems.
How to prevent: Require that the supervisor for that shift/period make an assessment of each of the operational staff for fitness for duty. This doesn’t need to be a formal interview, but a lot can be discovered by seeing each person and just asking a few questions. Having each person provide a turnover status for each of their areas at a pre-shift turnover meeting could satisfy this. Send anyone home that isn’t ready for the rigors of critical operations.
For activities that are critical to the safety and continued operation of the site, make it a practice that those activities must be accomplished by two people. This practice is used by the military, nuclear, airlines, and other mission critical environments. Having a second person checking actions and reading the procedure can prevent mistakes and make the activity interactive, reducing the risk of inattention.
Ever follow your vehicle’s GPS to a dead end or someplace where you cannot get to your destination because the roads have changed? You have been victim of a documentation error. It’s the same for the operations staff that uses outdated or incorrect documentation for the activity being performed. Using a drawing that has not been updated since the last system upgrades were done, using a procedure that is not up to date or actually doesn’t work are examples of documentation errors.
How to prevent: Develop and implement a formalized documentation control program. Do not allow operation of your plant with anything other than controlled documents. Implement a formalized process to validate your procedures. Ensure that sufficient information which is in the correct sequence is provided for the operations team to successfully complete the activity. I recommend formal engineering development of all critical activity procedures and processes.
Imagine being in a large steel tank, with pipes and literally hundreds of valves. Then imagine water gushing into that tank. Now imagine you have about 2 minutes to discover which valve is the water valve that shuts off the flow of water into the tank before you drown (some of you will recognize this scenario from submarine school training). Oh, by the way, none of the valves are labeled or color coded. It would have been nice to find the valve that said “water shut-off valve.” In training, they never make it that simple.
In critical environments, when we are asked to perform an activity, it is vital to know that we are operating the correct valves and switches in accordance with procedure. The procedures need to have valve/switch designations that match exactly the labeling in the plant. I have seen a simple cable labeling error shut down an operating nuclear power plant.
How to prevent: Implement a plant-standardized labeling and color-coding program. If you have a procedure to operate it, it needs to be labeled or coded as it appears in the procedure. A great practice is to provide a schematic of electrical circuits or flow path on the equipment itself to aid in operator understanding. Properly done, every switch, valve, or operator will be uniquely labeled and easily understood.
Lack Of System/Process Understanding
While working at a laboratory, I observed a lab technician recount a standard sample several times. She stated that she had to recount the standard sometimes as many as seven to ten times to get an “acceptable” reading from the radiation measuring device. It was about then I administratively shut down the lab. The lab technician was invalidating a statistical process to verify the radiation measuring device was operating correctly! The ramification was that the lab was potentially releasing radioactive materials into the general public. You can imagine the response that this caused. Every sample that this machine was used on had to be re-analyzed, the public was notified, and the incident literally made the evening’s national news, all because of a technician’s lack of understanding of the process.
You can have operators following procedures verbatim, but if they don’t understand the expected system responses, they can misinterpret what is happening, with resultant downtime or worse. Incidents that to some degree fall into this category are Three Mile Island, Bhopal, and the Challenger disaster.
How to prevent: Training, training, training. This is a solution that is not all that easy to implement within restrained budgets, limited training resources, and limited time, but there is no other way to fix this. There are some methods to stretch your training resources, but it must be done one way or another. The training program needs to implement some form of refresher or recertification process along with lessons learned from plant operational experiences.
I hope that this article provides some insight to human-caused downtime incidents. The prevention methods that I have listed have been used for years and proven in many mission critical environments to prevent human-caused incidents. I hope that you can use some of this in your facilities and I’m always open to new ideas on how to prevent human-caused incidents.