On Thursday, August 14, 2003, the largest electrical outage in North American history completed its approximately cascade at 4:13 p.m. (Eastern Daylight Time, also note that time of blackout varies by a minute or two, depending on the source) from its local origin with the assets of FirstEnergy Corporation in northern Ohio, westward to Michigan, and then eastward to New York and northward to Ontario, Canada. (1-2) FirstEnergy Corporation in August 2003 was the fifth largest electric utility in the United States, serving 4.4 million users in a 36,100 square mile service territory covering parts of Ohio, Pennsylvania, and New Jersey. It operated (in August 2003) 11,502 miles of transmission lines, and had 84 ties with other electric systems in the North American power grid. It comprised seven operating companies, including Ohio Edison, Toledo Edison, The Illuminating Company, and Penn Power. (1)

As a result of the electric failure cascade caused by the FirstEnergy Corporation’s “Energy Management System” control center’s inability to respond to voltage line problems in its footprint, a total of 508 generating units and 265 electric power plants in North America precipitously shut down. (3)
The following cities lost electric power: Greater New York City (21,100,000 people), Toronto (5,600,000), Greater Detroit (5,400,000), Cleveland (2,900,000), Ottawa (780,000 of 1,120,000), Greater Buffalo City (1,100,000), Rochester (1,050,000), Hamilton (680,000), London (Canada, 350,000), Toledo (310,000), and Windsor (208,000) for a total around 50,000,000 people without electric power. The electrical outage caused widespread failure of the water infrastructure (electric well pumps failed)—also a critical, networked infrastructure, in cities such as Detroit and Cleveland.

As a result of the outage, New York City reported 60 “serious” fires and 3,000 fire calls (think candles for lighting), 800 elevator rescues, and 80,000 calls to 911 (double the average). In Toronto, there were 1,484 fire calls, and 110 elevator rescues. About 400 flights were canceled in North America on Friday, August 15, 2003, as a result of the outage. (4)
The number of fatalities for this technological disaster were low, as follows: in Ottawa, 1 pedestrian struck by car and 1 fire victim; in Connecticut, 1 fatality, cause not identified, in NYC, 5 deaths, 2 from carbon monoxide, 2 in a fire, and 1 from a fall from a roof. (4) The economic cost of the blackout was calculated at $6 billion, according to one estimate. (4)
Power began to come back on in isolated areas on the evening of August 14, 2003, but New York waited two days until Saturday, August 16, 2003, for full power restoration. The average time to restart a nuclear power plant is about 36 hours.
The 2003 North American power outage is a prime example of hidden failure in a critical networked infrastructure, as described below.
1. Characteristics of Critical Networked Infrastructure and Hidden Failures
The electrical power systems in the United States are the most complex systems ever built by human beings. These systems are an example of “networked infrastructures” because they form a complex interconnected system that stretches over a large geographical area to reach (in principle) every economic entity and household in the geographical region. They are critical because without them, society as currently known and understood, cannot operate. Other critical networked infrastructures include water, banking, computers, and telecommunications. Their functions in a society are similar to those of arteries and veins that branch out through the human body to nourish every cell with vital nutrients, according to one metaphor. (5)
Networked systems bring strength to critical infrastructure because when one element of the network degrades, in theory other elements can pick up the slack during repair of the degraded element. The goal is to do this in a way that users do not even know there was a problem. In addition, networked systems tend to be more economical because in theory they reduce the need for extensive and expensive redundancies for various elements in the system; in this way, they also produce economies of scale for producers and ultimately users.
The integrity of critical networked infrastructures faces risks, however, for at least three reasons.
1) The networked systems are huge—either continental (e.g., North American 2003 electric power outage) or even global in size.
2) A local disturbance may cascade into a wide-system failure both within and across infrastructures (e.g., problems in the transmission lines in the footprint of, and in the main control room at, FirstEnergy Corporation in Ohio).
3) Critical networked infrastructures are more and more operated at the limit of their capacity. (Thursday, August 14, 2003 was a hot day and air conditioner use levels were high.)
The majority of the blackouts, according to the North American Electric Reliability Commission (NERC), including the 2003 North American blackout, are due to misoperation of the protection systems, which are called “hidden failures”. Hidden failures are defined as “hardware or software failures that are only exposed when a subsystem is highly stressed,” which was exactly the case at FirstEnergy Corporation where the blackout began on a hot summer’s day, August 14, 2003, as described further below. (5)
The hidden failures in the hardware and software of the computers and servers at FirstEnergy Corporation led to a “blindfolding” of system operators who lost their “situational awareness”--the degree of accuracy by which one’s perception of his or her current environment mirrors reality. (6) To make matters worse, a psychological denial process set in by which the operators at FirstEnergy Corporation were unable to receive the input of operators calling from other nodes in the grid who DID have accurate situational awareness as they witnessed abnormalities in the grid, from their perspectives.
2. Hidden Failures in Computer Alarm System that Began the Outage
Shortly after 2:14 p.m. on August 14, 2003, the alarm and logging system in the FirstEnergy Corporation Energy Management System control center failed and was not restored until after the blackout. This center was charged with monitoring the operation and reliability of the FirstEnergy Corporation control area, and was managed by a director of transmission operation services. Two groups of operators reported to the director. The first group was responsible for real-time operations and the second was responsible for transmission operations planning support (they sat in a room across the hall from the control room where they performed day-ahead studies). This information is contained in the excellent NERC document titled: “Technical Analysis of the August 14, 2003, Blackout.” (1)

The real-time operations group at FirstEnergy control area was divided into two areas: the control area operators and the transmission operators. Each area had two positions that were staffed 24 hours/day. In addition, a supervisor was present with responsibility for both areas 24 hours/day (split into three shifts). The transmission operators were in the main control room, the control area operators were in a separate room. (1)
3. Alarm Processor Failure at 2:14 p.m. EDT, 8/14/03
The alarm system failure occurred in the control room, which apparently was in the room separate from the main control room. The purpose of the alarm was “to provide audible and visual indications when a significant piece of equipment changed from an acceptable to problematic status.” (7) The alarm system essentially “stalled” while processing an alarm event. “With the software unable to complete that alarm event and move to the next one, the alarm processor buffer filled and eventually overflowed.” (7) After 2:14 p.m., August 14, 2003, the FirstEnergy Corporation control computer displays did not receive any further alarms, nor were any alarms being printed or posted on the Energy Management System’s alarm logging facilities. (7) “A period of more than 90 minutes elapsed before the operators began to suspect a loss of the alarm processor, a period in which, on a typical day, scores of routine alarms would be expected to print to the alarm logger.” (8) And guess what else! FirstEnergy personnel admitted that the alarm processing application had failed on several occasions prior to August 14, 2003! But apparently on this hot August day was the first time it really locked up due to code errors in the XA21 system in use. (8a)
The operators in the control room relied heavily on the alarm processor for their situational awareness, since they did not have a dynamic map board for large-scale visualization. They did not know their alarm processor had failed. Thus, they were NOT prompted to manually monitor and more closely interpret their SCADA system. SCADA, the acronym for “Supervisory Control and Data Acquisition”, refers to a “large-scale distributed measurement and control system used to monitor or control chemical, physical or transport processes.” Furthermore, it “usually refers to a central system that monitors and controls a complete site. The bulk of the site control is actually performed automatically by a Remote Terminal Unit or by a Programmable Logic Controller. Host control functions are almost always restricted to basic site over-ride or supervisory level capability.” (9)
Without an effective Energy Management System, the only remaining ways to monitor system conditions would have been through telephone calls and direct analog telemetry, according to the NERC report. (8) The FirstEnergy Corporation control room operators, however, did not realize that the alarm processing on the Energy Management System was not working and, subsequently, did not monitor other available telemetry that showed that the system was changing. When the control room operators began receiving calls from field operations’ workers in neighboring systems, who were detecting changes in the system, they were unreceptive to the input because their data input looked good. For the next hour and a half, operators blissfully went about their business while their system was degrading around them, despite receiving clues via phone calls from neighboring utilities, such as the huge American Electric Power (AEP) utility based in Columbus, Ohio, which operated the control area in Ohio just south of the FirstEnergy Corporation control area. (10)
4. Problems with the Backup Server?!
The FirstEnergy Corporations’ Electric Management System included several nodes that performed the advanced applications of the system. Any one of them can host all the functions, but the “normal FirstEnergy system configuration was to have several host subsets of applications, with one server remaining in a ‘hot-standby’ mode as a back up. At 14:41 [2:41 p.m. EDT], the primary server hosting the EMS [Electric Management System] alarm processing application failed, due either to the stalling of the alarm application, the ‘queuing’ to the remote [control] terminals, or some combination of the two.” (11) These remote control terminals were located in FirstEnergy substations where the data feeding into those terminals started queuing and overloading the terminals’ buffers. An alert technician in the field noticed that the terminal at his substation was not working and called the main control room at: 14:46 [2:46 p.m.] to report this. In addition, as each terminal failed, an automatic page was sent to FirstEnergy computer support staff.
When the primary server hosting the EMS alarm processing application failed, “the alarm system application and all other EMS software running on the first server automatically transferred (‘failed over’) onto the back-up server.” (10) This transfer had been (appropriately) pre-programmed. There was one problem, however. “Because the alarm application moved intact onto the back-up while still stalled and ineffective, the back-up server failed 13 minutes later, at 14:54 [2:54 p.m. EDT]. Accordingly, all of the EMS applications on these two servers stopped running.” (10)
When the second server failed at 14:54 [2:54 p.m.], the computer support staff first became aware that all of the functions normally hosted by server H4 had failed. They did not think of the possibility that the alarm processor had failed 40 minutes earlier, even though they should have been aware that concurrent loss of the two servers would mean the loss of alarm processing on the Electric Management System.

5. When Two Servers in the Electric Management System Fail, What Happens?
The simultaneous loss of two Electric Management System servers “apparently caused several new problems for the FirstEnergy Electric Management System….A concurrent absence of these servers significantly slow[ed] down the rate at which the Electric Management System refreshed displays on operators’ computer consoles. Thus, at times on August 14, operator screen refresh rates, normally one-to-three seconds, slowed to as long as 59 seconds per screen. Since FirstEnergy operators [had] numerous information screen options, and one or more screens [were] commonly ‘nested’ as sub-screens from one or more top level screens, the operators’ primary tool for observing system conditions slowed to a frustrating crawl. This situation likely occurred between 14:54 and 15:08 [2:54 p.m. to 3:08 p.m.], when both servers failed, and again between 14:46 and 15:59 [2:46 p.m. to 3:59 p.m.], while FirstEnergy computer support personnel attempted a ‘warm reboot’ of both servers to remedy the alarm problem.” (11)
6. Computer Support Staff Get Active
When the first server froze, the computer support staff received an auto-page. When the back-up server failed, a second auto-page went to the computer staff. They did not, however, communicate the loss of alarm functionality to the FirstEnergy system operators, nor did they have a procedure to do so, according to the NERC report. They began work to fix the servers at 14:54 [2:54 p.m.], and completed the primary server restart via a “warm reboot” at 15:08 [3:08 p.m.]. The computer support staff did not notify the control room that they were rebooting.
Diagnostics were performed during the warm reboot that verified that the computer and all expected processes were running. “Accordingly, the FirstEnergy computer support staff believed that they had successfully restarted the node and all the processes it was hosting. However, although the server and its applications were again running, the alarm system remained frozen and non-functional, even on the restarted computer. The computer support staff did not confirm with the control room operators that the alarm system was again working properly.” (8)
7. What’s the Difference between a Warm Reboot and a Cold Reboot?
A warm reboot means only the problematic node is shut down and restarted. A cold reboot is one in which all nodes (e.g., all computers, consoles) of the system are shut down and then restarted. “A cold reboot takes significantly longer to perform than a warm one. Also, during a cold reboot,” according to the NERC report, “much more of the system is unavailable for use by the control room operators for visibility or control over the power system. Warm reboots are not uncommon, whereas cold reboots are rare. The cold reboot undertaken early August 15 [the day after the outage] corrected the alarm processing problem.” (8)
8. Meanwhile at the Computer Consoles in the FirstEnergy Control Room…
Recall that the alarms froze at 2:14 p.m. Not until 3:45 p.m. did the FirstEnergy operators begin to realize the trouble their system was in. What trouble was their underlying system (the transmission lines) in? A lot of trouble!
9. Trouble in the Field: Early “Eastlake 5” Generator Problems
At 1:31 p.m. on Thursday, August 14, 2003, the FirstEnergy Corporation’s Eastlake 5 generating unit, located in northern Ohio along the southern shore of Lake Erie, tripped offline due to an exciter failure while an operator was making voltage adjustments. There is a record of a conversation between the Eastlake 5 operator and the Energy Management System operator at 1:16 p.m. on August 14, as follows:
EMS Operator: “Hey, do you think you could help out the 345 voltage a little?”
Eastlake 5 Operator: “Buddy, I am—yeah, I’ll push it to my max max. You’re only going to get a little bit.”
EMS Operator: “That’s okay, that’s all I can ask.” (12)
At 1:16 p.m., efforts of the Eastlake 5 operator to give more voltage—the “max max”—
resulted in a reactive output increase lasting about seven minutes before records show that the automatic voltage regulator tripped due to exciter failure. Then trouble ensued with the pump vale at the plant so that the operator could not reset the trip. Thus Eastlake 5 generating unit was out of service for the while.
The outcome desired by the FirstEnergy operators in the control room—to increase voltage to better meet load requirements to the Cleveland-Akron service area—did not occur. Instead, there was at first a DECREASE in reactive support to the Cleveland-Akron area. After Eastlake 5 tripped, flows caused by replacement power transfers and the associated reactive power to support these additional imports into the area INCREASED, contributing to higher (transmission) line loadings on the paths into the Cleveland area. (13)
Eastlake 5’s tripping was an electrically significant step in the sequence of events, although it was not the cause of the blackout, according to the NERC report. (14) However, when it tripped, it set up the situation in which the next transmission line trip somewhere on the system COULD put the system at grave risk. This is called contingency analysis, which involves analyzing the ability of the system to withstand the next worst contingency event without exceeding emergency ratings. Evidently, the FirstEnergy operators did not routinely conduct such studies on shift. In particular, the operators did not use contingency analysis to evaluate the loss of Eastlake 5 at 1:31 p.m. to determine whether the loss of another line or generating unit would put their system at risk. Guess what happened?
10. The Cascade Begins…
Recall that the alarm processor in the FirstEnergy control room failed at 2:14 p.m. and caused an acute loss of situational awareness of system conditions by the FirstEnergy operators. At 2:27 p.m. a 345-kilovolt tie between FirstEnergy Corporation and American Electric Power (AEP), the huge utility south of FirstEnergy (see #10 in notes below), opened and reclosed. AEP operators called FirstEnergy operators to confirm the operation (opening and reclosing of the huge line) and FirstEnergy operators denied that their system had a problem. The reader knows now why. This phone call from AEP operators to FirstEnergy operators was the first proof of loss of situational awareness.
Then, between 3:05 p.m. and 3:42 p.m., three (3) separate FirstEnergy 345-kV transmission lines supplying the Cleveland-Akron area tripped and locked out because the lines contacted overgrown trees with FirstEnergy Corporation’s right of way! (15) The FirstEnergy operators were unaware of these three rather large problems in their system until 3:45 p.m. Indeed, had they performed contingency analyses and had they been aware that their Chamberlin-Harding 345-kV line was down at 3:05 p.m., they would have realized that the “next worst contingency problem” had indeed arrived.
Unfortunately for North America, when the three FirstEnergy 345-kV lines failed, all of their power flowed to 16 much smaller 138-kV lines, which then overloaded and tripped over 30 minutes from 3:39 p.m. to 4:09 p.m. in what has been described as a “cascading failure of the 138-kV system in northern Ohio.” (15) Several of these 138-kV line trips were due to the heavily loaded lines sagging into vegetation, distribution wires, and other underlying objects, according to the NERC report. (15)
Then yet another 345-kV line (the Sammis-Star line) overloaded because of the tripping of all the 138-kV lines at 4:05:57. When the heavily overloaded 345-kV Sammis-Star line tripped, “major and unsustainable burdens on other lines” resulted, “first causing a “domino-like” sequence of line outages westward and northward across Ohio and into Michigan, and then eastward, splitting New York from Pennsylvania and New Jersey.” (16)
Since the FirstEnergy operators had no situational awareness, they lacked the capability to shed the load quickly needed to prevent the blackout (1,500 to 2,500 megawatts in the Cleveland-Akron area). As it turned out, investigators learned that FirstEnergy did not provide its operators with “the capability to manually or automatically shed that amount of load in the Cleveland area in a matter of minutes, nor were procedures in place for such an action.” (16)
11. When Hell Broke Loose in FirstEnergy Control Room
Documented phone calls began streaming into the FirstEnergy control room from various sources beginning at 3:35 p.m. and are listed in the NERC report for those interested in reading about them (p. 47). At 3:48 p.m., a FirstEnergy operator first grasped that an emergency situation had developed. At 3:59 p.m., the 138-kV lines began tripping (see above), then the Sammis-Star line tripped (4:05:57), and the “conditions were set for an uncontrolled cascade of line failures that would separate the northeastern United States and eastern Canada from the rest of the Eastern Interconnection, than a breakup and collapse of much of that newly formed island…No events, actions, or failures to take action after the Sammis-Star trip were deemed to have caused the blackout. By 4:13 p.m. (EDT), August 14, 2003, more than 508 generating units at 265 power plants had been lost, and about 50 million people were without electric power. Additional information on the cascade and restoration of the system is beyond the scope of this report, but is available in the NERC report for interested readers. (1)
12. Conclusions of the NERC Report
The Thursday, August 14, 2003, North American power outage was in many ways similar to earlier blackouts, including the 1965 power outage discussed elsewhere, which prompted the formation of the NERC in 1968. (17) Common factors included: “conductor contacts with trees, inability of system operators to visualize events on the system, failure to operate within known safe limits, ineffective operational communications and coordination, inadequate training of operators to recognize and respond to system emergencies, and inadequate reactive power resources. (18)
NERC investigators drew several conclusions, three of which follow.
Several entities violated NERC operating policies and planning standards, which at the time were voluntary, and those violations contributed directly to the start of the cascading blackout.
- The approach used to monitor and ensure compliance with NERC and regional reliability standards was inadequate to identify and resolve specific compliance violations before those violations led to a cascading blackout.
- Available system protection technologies were not consistently applied to optimize the ability to slow or stop an uncontrolled cascading failure of a power system.
- There was a failure to manage vegetation, train operators well, and provide adequate tools so that operators could visualize system conditions.
In addition, investigators stated that “[t]he causes of the blackout did not result from inanimate events, such as ‘the alarm processor failed’ or ‘a tree contact a power line.’ Rather, the causes of the blackout were rooted in deficiencies resulting from decisions, actions, and the failure to act of the individuals, groups, and organizations involved. These causes were preventable prior to August 14 and are correctable.”
13. Official Causes of the Blackout
1. FirstElectric operators lacked situational awareness of line outages and degraded conditions on the FirstEnergy system.
a. FirstElectric had no alarm failure detection system.
b. FirstElectric computer support staff did not effectively communicate the loss of alarm functionality to the FirstElectric operators after the alarm processor failed at 2:14 p.m., nor did they have a formal procedure to do so.
c. FirstElectric computer support staff did not fully test the functionality of applicati9ons, including the alarm processor, after a server failover and restore.
d. FirstElectric operators did not have an effective alternative to easily visualize the overall conditions of the system once the alarm processor failed.
e. FirstElectric did not have an effective contingency analysis capability cycling periodically on-line and did not have a practice of running contingency analysis manually as an effective alternative for identifying contingency limit violations.
2. FirstElectric did not effectively manage vegetation in its transmission rights of way.
3. Reliability coordinators (not covered in this Biot) did not provide effective diagnostic support.
4. The NERC programs did not identify and resolve specific compliance violations before those violations led to a cascading blackout.
Please see pp. 98-118 in the NERC report for many pages of additional deficiencies and recommendations. (1)
14. Hidden Failures in Critical Networked Infrastructures Redux
The 2003 North American power outage is an excellent example of a hidden failure that manifested itself when electric power systems were stressed due to a hot day in August in the Midwest. The hidden failure was the software code error that froze the alarm processor. However, that one glitch led to a cascade of computer misoperations AND management errors that manifested themselves as operators within the system lost their situational awareness. On this day, loss of situational awareness intersected real problems on the system, and resulted in a technological disaster.
In situations like these, people IN the system often receive the blame for what went wrong with the system. In truth, the system AS DESIGNED was at fault. The organization and its management and leadership are responsible for the design of the system. Thus, the organization is accountable for the performance.
Since the 2003 power outage, the US Congress has passed legislation (US Energy Policy Act of 2005) that has resulted in assignment of NERC as the standards-based Electric Reliability Organization (ERO), which now has the authority to enforce compliance with standards among certain electric utilities, including FirstElectric Corporation. These standards and NERC’s performance is overseen by Federal Electric Reliability Commission (FERC). (19)
Sources:
1. North American Electric Reliability Council: “Technical Analysis of the August 14, 2003, Blackout”, July 13, 2004, pp. 7-8. Available online at: ftp://www.nerc.com/pub/sys/all_updl/docs/blackout/NERC_Final_Blackout_Report_07_13_04.pdf; accessed August 23, 2006.
2. For a good aggregation of official documents relating to the 2003 electric power outage, please see Harvard Electricity Policy Group webpage at: http://www.ksg.harvard.edu/hepg/Blackout.htm; accessed August 23, 2006.
3. NERC, p. 58. See also: Canadian Broadcasting Company News, “Blackout by the Numbers” August 15, 2003, updated November 14, 2003. Available online at: http://www.cbc.ca/news/background/poweroutage/numbers.html; accessed August 23, 2006.
4. 2003 North American blackout. Available online at: http://en.wikipedia.org/wiki/2003_North_America_blackout; accessed August 23, 2006.
5. See: SEMP Biot #112: “What Is Hidden Failure in Critical Infrastructure?” (September 6, 2004) at: http://www.semp.us/biots/biot_112.html; accessed August 23, 2006 and Hazard Risk Unit publications of the World Bank: “Mitigating the Vulnerability of Critical Infrastructure in Developing Countries” by Lamine Mili (2002). Full text available at: http://www.worldbank.org/hazards/files/conference_papers/mili.pdf, unfortunately no longer online at this URL.
6. Definition of situational awareness from the US Navy at: http://wwwnt.cnet.navy.mil/crm/crm/stand_mat/seven_skills/SA.asp; accessed August 24, 2006.
7. North American Electric Reliability Council, p. 32.
8. NERC report, p. 34.
8a. NERC report, p. 38.
9. SCADA at: http://en.wikipedia.org/wiki/SCADA; accessed August 24, 2006.
10. At the time of the 2003 outage, AEP owned and operated more than 80 generating stations with more than 42,000 megawatts of generating capacity in the US and international markets. In August 2003, it was one of the largest electric utilities in the US with more than 5,000,000 customers in portions of Arkansas, Indiana, Kentucky, Louisiana, Michigan, Ohio, Oklahoma, Tennessee, Texas, Virginia, and West Virginia. At the time of the 2003 electric outage, it operated about 39,000 miles of electric transmission lines. This information was taken from the NERC report (see #1 above), p. 35.
11. NERC, p. 33.
12. NERC, p. 29.
13. For instance, at 3:00 p.m., FirstEnergy load was approximately 12,080 megawatts, and FirstEnergy was importing about 2,575 megawatts, or 21% of the total load. With imports this high, FirstEnergy reactive power demands, already high due to the increasing air-conditioning loads that afternoon, were using up nearly all available reactive resources. See: NERC, p. 30.
14. NERC, p. 30
15. NERC, p. 27.
16. NERC, p. 28.
17. SEMP Biot #387: “The Monster 1965 Northeast United States Blackout: Beginning of Addressing the Nation’s Electric Power Reliability Issues” (August 12, 2006). Available online at: http://www.semp.us/biots/biot_387.html; accessed August 24, 2006.
18. NERC, p. 94.
19. See press statement “NERC Approved as the United States Electric Reliability Organization” at: ftp://www.nerc.com/pub/sys/all_updl/docs/pressrel/07-20-06-NERC-Named-ERO.pdf; accessed August 25, 2006.
Additional reading:
1. See: “Emergency Preparedness in Electric Utilities: An Interview with Commonwealth Edison’s Robert Plant” in Securitas 4:2 (Mar-Apr 2005). Scroll down to article. Available online at: http://www.semp.us/securitas/mar_may05.html; accessed August 24, 2006.