Full-width Image
11. Starting the Deployment of Automated IT Operation Systems


Setting goals and creating phased plans are key steps in the deployment of an automated IT operation system. Here is a detailed explanation of the construction goals and phases:

Construction Goals and Phases

Initial Construction Phase
Goal Setting: The initial phase aims to establish a comprehensive automated IT operation system. This process requires a significant amount of time and effort, involving extensive expertise and collaboration across different departments.
Obstacles and Challenges: In addition to technical preparations, internal communication and coordination within the organization are also important challenges. Resistance from different teams may be encountered during this phase.

Execution Content
➣ Basic Setup: Includes preparing the data center facilities, initial deployment of the monitoring system, etc.
➣ Advanced Setup: Adding more features and monitoring items on the basic infrastructure.
➣ Emergency Incident Handling: Establishing response procedures for emergencies.
➣ Network Architecture Restructuring: Adjusting the existing network architecture based on operation needs.
➣ Operation Process Establishment: Formulating standard operating procedures for daily operations.
➣ Performance Display and Regular Drills: Demonstrating the system's performance through real cases.

Three Phases of Execution Progress

Phase 1 - Basic Setup
➣ Establishing Monitoring List: Covers all network devices in the IDC data center.
➣ Setting Alarm Thresholds: Initially using system default values as alarm thresholds.

Phase 2 - Adjustment and Supplementary Setup
➣ Adjusting Alarm Thresholds: Adjustments based on actual operation conditions and equipment characteristics.
Example: Microsoft Windows Exchange memory usage of 91% is considered normal; the alarm threshold can be adjusted to 95%.
➣ Enhancing Monitoring Items: Adding more monitoring items as needed.

Phase 3 - Management and System Establishment
➣ Adapting to New Operation Models: Automated systems may change traditional operation management methods.
➣ Pressure and Challenges: Real-time notifications and extensive automated monitoring may pressure operation personnel.
➣ Rapid Decision Making and Problem Solving: Establishing effective problem reporting, handling, and recording procedures.
➣ Personnel and Management System: Re-planning personnel arrangements and management systems according to different periods and conditions.

Detailed [Construction Progress Table] can be referenced here

Before starting the deployment of the automated IT operation system, sufficient preparation is necessary to ensure a smooth process and effectively reduce construction costs and time. Here are some key preparation items:

Preparations for the Construction of an Automated IT Operation System
Server Hosts
➣ Confirm the IP list, hardware brand, and operating system (e.g., Windows, Linux, AIX).
➣ Prepare installation accounts, including local accounts or AD accounts. If using AD accounts, they must have the same permissions as an Administrator.
➣ Classify server hosts according to their importance level (Class A, Class B, Class C) and establish them in order.
➣ Assign localized aliases to server hosts for graphical monitoring and alarm notifications.
➣ Ensure antivirus software adds the monitoring system to the program whitelist.
➣ Set local and network firewalls to open ports 5000-5012. If through NAT, set the corresponding IP and port.
➣ Define special alarm notification lists.
➣ Install the Watchdog Client system.

Network Devices
➣ Covers switches (L2, L3 Switch), routers (Router), firewalls (Firewall), UTM, load balancers (Load Balancer), etc.
➣ Prepare data for network devices, commonly using SNMP and CLI protocols.
➣ Required data includes the device IP list, hardware brand, SNMP group name (e.g., public), SNMP activation, CLI account, and password.
➣ Classify network devices according to their importance level (Class A, Class B, Class C).
➣ Record the rental speed and port number location of dedicated lines or GSN lines.
➣ Prepare CLI commands for configuration file backups.
➣ Assign localized aliases to network devices for graphical monitoring and alarm notifications.
➣ Define special alarm notification lists.
➣ Confirm that the intermediate firewall opens UDP 161, 162 ports.

Server Hosts - Hardware [IPMI/ILO/IMM/iDRAC]
IPMI, ILO, IMM, and iDRAC are management systems for monitoring and managing hardware aspects of the motherboard, power, temperature sensors, fan status, etc. They belong to BIOS-level settings and management.
➣ Device IP list
➣ Hardware brand
➣ Dedicated management network segment
➣ Required network cabling
➣ Dedicated switch [Switch]
➣ BIOS needs to set IP, user, and password
➣ Intermediate firewall needs to open UDP 623 port

PDU Power Strips
PDU power strips are commonly used smart power strips with network management functions in racks, commonly found in data centers. They can obtain power usage status and analyze power usage through the SNMP protocol.
➣ Device IP list
➣ Hardware brand
➣ Dedicated management network segment
➣ Required network cabling
➣ Dedicated switch [Switch]
➣ Intermediate firewall needs to open UDP 623 port

Auxiliary Environment Monitoring
Auxiliary environment monitoring is mainly used in conjunction with the original environmental control system for integration, including graphical control, alarm notifications, and other items. Since the integration capability of the original environmental control system varies, different integration methods are adopted according to its capabilities.
➣ The original environmental control system can output status data
It can integrate data and information through [Server Host - Event Alarm] and [Server Host - Event Data], integrating into the Watchdog system for graphical monitoring, alarm notifications, and other functions.
➣ The original environmental control system cannot output status data
The monitoring screen can be regularly copied and archived using [Snapshot] and integrated into the Watchdog system for graphical monitoring.
➣ The original environmental control system cannot expand to add temperature and humidity sensors:
The Watchdog system's [Analog Input AI - Temperature and Humidity] or [Digital Input/Output DI, DO] detection system can be used to make up for the deficiency of its equipment.

Emergency Shutdown - Phase 2
The second phase of emergency shutdown uses [Physical Button] and [One-Button Shutdown] to execute the emergency shutdown method. Emergency shutdown is usually used in special emergency situations (e.g., power outage, generator failure, insufficient UPS power, fire alarm). If the [Physical Button] cannot be used, the [Virtual Button] can also be used to execute the [One-Button Shutdown] function.
The benefits of using [Physical Button] and [One-Button Shutdown]:
➣ In an emergency, any duty personnel can execute it immediately through telephone authorization, without permission or professional issues.
➣ It can be executed without login and password in limited time.
➣ Shutdown follows the standard procedure.
➣ Fast shutdown speed (synchronously issuing instructions).
➣ In case of a fire in the computer room, personnel cannot operate inside but can operate externally.
➣ Physical devices are easy to operate, demonstrate, and use.

The following items need to be prepared:
➣ Small PLC
➣ Button module
➣ Audio broadcast
➣ Warning lights
➣ Shutdown list can be divided into Class A, Class B, Class C, shut down in order of C -> B -> A
➣ Standard execution procedure

Network Cameras
Using network cameras is mainly for [Snapshot Recording], suitable for recording computer room access or special important locations. Using photo recording can assist in original registration, and access control is clearer. The advantage of [Snapshot] is that the file space usage is small and easy to trace records.

The following items need to be prepared:
➣ Network cameras with CGI function
➣ Device IP list
➣ Hardware brand
➣ Sensors [e.g., infrared sensors, reed switches]
➣ Small PLC
➣ Lighting fixtures

Special Attention Items after Automating the IT Operation System
As enterprises embark on the journey of automated operation, they will inevitably encounter a series of challenges and demand adjustments. To help enterprises transition smoothly and maximize the benefits of the automated system, here are some special attention items to note after the construction and application of the automated IT operation system:

Network Devices
➣ Pay attention to whether there is a high concentration of traffic at the ports.
➣ Confirm whether VM hosts are concentrated on a single port.
➣ Check whether there is a speed drop at the ports.
➣ Evaluate whether the load balancer architecture is overly concentrated on specific servers.

Server Host System
➣ Ensure the accuracy of the server host system time.
➣ Optimize and adjust the operating system and application system.
➣ Optimize the storage space configuration of the storage system.
➣ Monthly ranking of server anomaly alarms, prioritizing problem resolution.

VM Servers (VMWare)
➣ Monitor the allocation and usage status of CPU and Memory resources.
➣ Check the resource allocation and performance status of Guest hosts.
➣ Confirm the installation status of VMware Guest host VMtools and PowerOff/On status.

Data and Message External Integration
➣ Utilize data and messages collected by the Watchdog system for external integration into other systems.
➣ Integrate into the operation management platform, push system, or other system integration hosts.

Management and System Establishment
➣ Establish a reboot mechanism.
➣ Arrange real-time monitoring screen display.
➣ Monitor servers according to the system responsible person.
➣ Establish real-time monitoring incident handling procedures.
➣ Set up customer service center incident handling procedures.

Establish Reverse Monitoring System
➣ Establish network reverse detection points at each site, pointing to various servers used by users. Always collect network quality data from the user's end to the server end.

Operational Performance (KPI) after Automating the IT Operation System
After building the automated IT operation system, to effectively monitor and evaluate system performance and status, the following operational performance indicators (KPI) can be used for data analysis and statistics:

Server Hosts
➣ Total number of hosts
➣ Total number of normal hosts
➣ Total number of abnormal hosts
➣ Total number of suspended hosts
➣ Total number of maintenance hosts
➣ Total number of disconnected hosts
➣ Total number of hosts with CPU usage greater than 90%
➣ Total number of hosts with Memory usage greater than 90%
➣ Total number of hosts with virtual memory (SWAP) usage greater than 90%
➣ Total number of Class A / Class B / Class C hosts

Virtual Hosts (VMHost)
➣ Total number of hosts
➣ Total number of normal hosts
➣ Total number of abnormal hosts
➣ Total number of suspended hosts

Packet Testing
➣ Total number
➣ Total normal
➣ Total abnormal
➣ Total loss greater than 50%
➣ Total response time within 1ms - Excellent
➣ Total response time within 5ms - Good
➣ Total response time within 10ms - Fair
➣ Total response time within 50ms - Poor
➣ Total response time within 100ms - Very Poor

IP Communication Ports
➣ Total number
➣ Total normal
➣ Total abnormal

Website Monitoring
➣ Total number
➣ Total normal
➣ Total abnormal

Reviewing these Watchdog detection items, these indicators are crucial for evaluating and improving the efficiency, reliability, and performance of the IT operation system. They can help the operation team promptly identify issues, optimize resource allocation, and enhance system stability.



Back to Previous Page