Full-width Image
Emergency Shutdown Incident Handling - IT Information Room Defense Battle


Power Outage/Failure - IT Information Room's Worst Nightmare
In modern IT information room operations, power outages and other unexpected situations often lead to severe crises, not only threatening the safety of equipment but also potentially causing unforeseeable impacts on business operations.
Facing such challenges, a solution that can quickly and effectively respond to emergencies is crucial.

The WATCHDOG system, in this context, provides a "One-Key Shutdown" function, in addition to the information detection of various equipment. This function allows users to, in the event of an emergency such as a power failure, simultaneously shut down hundreds of servers running different operating systems with one key, ensuring the safety of the equipment and the integrity of the data, thereby minimizing potential losses and risks.


Common Causes of Power Outages
➢Power grid interruption
➢Circuit failure
➢Power equipment failure
➢Scheduled maintenance
➢Power outage drills

The Key Role of UPS and Generators
Most information rooms are equipped with uninterruptible power systems (UPS) and generators to cope with power outages.
However, due to geographical factors, some rooms cannot install generators. Under normal circumstances, UPS can provide 20 to 30 minutes of power support.
The operation of the generator is managed by the building management unit, but its coordination with the UPS often lacks regular drills, and its effectiveness can only be verified during actual power outages.

Emergency Shutdown Strategies and Steps
In unexpected emergencies, such as power grid interruptions or fires, emergency shutdown becomes urgent, with shutdown time limited to the 20 to 30 minutes supported by the UPS. Therefore, pre-planning the emergency shutdown process is crucial.

Classification and Handling of Information Equipment
➢Devices that can be directly powered off, such as switches.
➢Devices that require a normal shutdown process, such as operating system servers (Windows, Linux, etc.).
➢Devices that require pre-shutdown or command procedures, such as VMware hosts (VMHost).
➢Devices that need to shut down automatically after the UPS is fully drained or wait for all devices to shut down before shutting down, such as storage systems.

When the system is not shut down according to the normal procedure, it may cause serious problems when starting up next time. This should be something all experienced system engineers have encountered, but there is no other way.

Planning 【Emergency Shutdown】 Strategies and Steps
In modern information centers, the structural interdependence between servers and virtual hosts is extremely complex,
making it a major challenge to smoothly restart after a safe shutdown while maintaining the original interdependence. Therefore, it is essential to precisely define a complete shutdown standard operating procedure (SOP) to ensure that the system can be restored to its original structure in the shortest possible time.

The Importance of Normal Procedure Shutdown
In emergencies, it is crucial to perform a normal procedure shutdown according to the characteristics of different operating systems. Doing so ensures that the system can return to its initial state correctly and can restart without obstacles, avoiding damage to the system's data structure.

Segmented Shutdown by Importance Level of Servers
In the WATCHDOG system, servers included in the monitoring can be divided into A, B, and C levels based on their importance. These importance levels can be defined by the user according to their environment, and after defining, the servers can be grouped for batch shutdown using the importance level, allowing more important servers to run longer.

➢Divide servers into A, B, and C levels based on their importance.
➢Create emergency shutdown groups and perform segmented shutdown according to importance levels, prioritizing the shutdown of less important servers first.

Shutdown Command Execution Methods
1. Direct execution of preset commands by the Watchdog system:
Using various shutdown commands: including socket command, batch command, PowerCli command, etc.,
each command can be mixed and matched according to actual needs. For example, the "socket command" is a fast and commonly used shutdown method,
which can issue instructions through 【Command Gateway】 to perform standard procedure shutdown according to preset instructions.

2. Executing shutdown commands through proxy hosts:
Using a host to execute remote shutdown operations of various application server host groups is called the 【Proxy Host】,
driven by Watchdog using the 【Command Gateway】 to execute proxy shutdown.
➥Using batch command, suitable only for Microsoft Windows systems as 【Shutdown Proxy Host】
➥Using PowerCLI to shut down the VMware system according to the VC list, suitable only for Microsoft Windows systems as 【Shutdown Proxy Host】

Shutdown Action Execution Specifications
➣Each shutdown action can be independently combined and form a group action by multiple "Command Gateways".
➣The execution order can be arranged or the delay execution time can be set.
➣Each "Command Gateway" can issue shutdown commands to up to 256 hosts.
➣Each "Shutdown Action" can issue shutdown commands to more than 1,000 hosts at the same time.
➣Customizable 【Emergency Shutdown】 policies can be defined to follow the preset standard procedures.
For example: The shutdown order is divided into two batches, and if the host in the second batch is selected, it will be shut down in the second round.

Special Considerations
➣Special command actions to be executed before shutdown.
➣Consider the dependency and shutdown sequence between systems (such as DB and AP).
➣Pay special attention to handling Vmotion and Cluster/HA migration issues.
➣When shutting down, consider the attachment relationship between VMGuest and VMHost. Do not shut down the VMHost before VMGuest is shut down.
➣When using PowerCLI to shut down the VMware system according to the VC list, pay special attention to the priority and 【Cluster/HA】 issues.
➣When using the 【Batch Command】 method to shut down, pay special attention to the AD/DC priority issue.


Considerations for Executing Emergency Shutdown Strategies
When facing a situation that requires an 【Emergency Shutdown】, several key factors need to be carefully considered to ensure the process can proceed smoothly and efficiently.

Factors to Consider
➢Whether there are duty personnel and their operational capabilities during off-duty hours or typhoon periods.
➢The smoothness of emergency shutdown, such as avoiding delays due to forgetting passwords.
➢Clear emergency shutdown contact steps and control methods.
➢Establish safety measures to prevent accidental triggering of emergency shutdown.
➢Consider the installation location if using 【Physical Button】.

Using the Watchdog system for emergency shutdown can be done in the following ways
➢Operate Watchdog's menu through the browser interface.
➢Use the browser interface to operate the preset 【Virtual Button】 of Watchdog.
➢Establish a hardware-based 【Physical Button】 module for 【One-Key Shutdown】.

Using 【Physical Button】 is the simplest and fastest method, allowing quick execution of shutdown in emergency situations, minimizing losses and risks.

Advantages and Disadvantages of Each Solution
Understanding the pros and cons of each option is crucial when considering different solutions for executing an 【Emergency Shutdown】.

Using the Browser Interface to Operate Watchdog's Menu



Advantages:
➢Preset function, no devices or costs required, immediately available.
Disadvantages:
➢Common issues include forgetting passwords, unable to find operating functions, forgetting execution procedures, unsure if the execution process is correct, no experienced engineer on site, human error during handover, less physical reality.

Using the Browser Interface to Operate the Preset 【Virtual Button】 of Watchdog


Advantages:
➢Preset function, no devices or costs required, immediately available.
➢Simulated physical button, more realistic execution.
➢Convenient to operate using mobile devices.
Disadvantages:
➢Common issues include forgetting passwords, unable to find operating functions, forgetting execution procedures, unsure if the execution process is correct, no experienced engineer on site, human error during handover, higher risk of using mobile devices, easier to press the wrong button, less physical reality.

Establish a Hardware-Based 【Physical Button】 Module for 【One-Key Shutdown】
Advantages:
➢Simple and quick, easy to operate.
➢Can establish a timed drill or display 【One-Key Shutdown】 mechanism.
➢Any duty personnel can execute immediately with phone authorization in an emergency.
➢Physical buttons provide a more realistic feel during execution.
➢Easier to operate, display, or use physical devices.
➢Having physical devices is more effective when demonstrating.
➢Adding sound broadcasting and warning lights provides a more integrated execution process.
➢Adding real-time shutdown monitoring screens immediately displays the shutdown status of the host.
➢In addition to the above advantages, it has excellent effects during drills and demonstrations.
Disadvantages:
➢Requires additional equipment and costs, such as small PLCs, button sets, small wiring.
➢Cannot be operated remotely (but can be assisted with 【Virtual Button】).


If You Want to Use 【Physical Button】 for One-Key Shutdown
The 【One-Key Shutdown】 mechanism, through the 【Physical Button】 to execute the standard procedure for emergency shutdown, is an effective solution when a comprehensive emergency shutdown is required.
The system or duty personnel only need to press the emergency button in real-time to trigger the overall server shutdown according to the normal procedure.

In designing this solution, we deeply considered various practical scenarios to ensure that the system is not only powerful but also easy to use.
To help units that want to adopt the 【One-Key Shutdown】 function avoid challenges that may arise during initial use,
we combined common problems encountered in practice and proposed the following directions to ensure your transition process is smooth and efficient.

Factors to Consider When Establishing a 【Physical Button】 Emergency Shutdown System:
Equipment Auxiliary Devices:
➢Physical button control platform and button module
➢Small PLC
➢Sound broadcasting (sound effectors, amplifiers, broadcast speakers)
➢Emergency situation warning lights

Equipment Location of the Button Control Platform:
➢Avoid placing it in the computer room to prevent it from being inoperable during a fire
➢Consider the flow during drills and demonstrations
➢Choose a location with camera monitoring

Safety Control Measures:
➢Use safety facilities of the button control set, such as key locks, button panel covers, sealing tapes
➢Establish marking gateways and execution control points to ensure the accuracy of execution procedures
➢Perform execution control confirmation via SMS commands

Regular Inspection of Shutdown Facilities:
➢Conduct packet testing on PLCs
➢Connection and system testing of proxy shutdown hosts
➢Conduct packet testing and IPMI information testing on application server host groups

Shutdown Monitoring Graphic Interface:
➢Design an intuitive monitoring interface, including:
➢DI PLC packet testing
➢DI Trap alert testing
➢Command gateway alert testing
➢IP port (socket connection testing)
➢Proxy execution - packet testing of hosts and shutdown lists
➢VMhost/VMguest host packet testing
➢IPMI Power status