Watchdog IT Operations Monitoring System

6. Key Points for Deploying an Automated IT Operations System

In today's rapidly changing technological environment, deploying an automated IT operations system has become a key strategy for companies to maintain their competitiveness.
This chapter will explore how to effectively automate an operations system, ensuring that the management of IT data centers can accurately and quickly provide the necessary information services to end-users.
From detailed specifications for monitoring equipment to cross-departmental collaboration for debugging, from unified management systems to simplified technical applications, this article will analyze the key concepts and implementation details of an automated operations system.
We will further define the concept of "survival indicators," explore information collection strategies, and discuss how to establish a rigorous anomaly event alert mechanism and enhance operational efficiency through the integration of related information.

Important Concepts

➣Recognize that the primary principle of IT data center operations management is to accurately and quickly provide information services to end-users.

➣Deploy detailed specifications and information monitoring according to the equipment's purpose and characteristics.

➣Identify survival indicators for service purposes.

➣Include any potentially dangerous system operation items extensively.

➣Establish a rigorous anomaly event alert mechanism.

➣Implement real-time information through the concept of a war room.

➣Integrate related data and structures of equipment and systems.

➣Implement cross-departmental or cross-equipment debugging mechanisms.

➣Unify consistent management systems and standard procedures.

➣Simplify the application of operational systems as a technological tool.

➣Assist new IT management personnel in quickly understanding the settings and applications of equipment.

➣Assist IT management personnel in professional education and training.

➣Establish standard procedures for automated operations to reduce the impact of staff changes.

Definition of Survival Indicators

➣Servers and operating systems are the environment in which application systems reside. Each server has its main application purpose, and understanding this main application purpose is key to knowing the monitoring focus of that server.

➣The operation of a server cannot be determined solely by the server's performance (e.g., one CPU has a 10% usage rate, another has a 90% usage rate).

➣Each piece of information equipment is activated for a primary purpose, which is an important survival indicator for the "service purpose."

➣When the "service purpose" function of the equipment is lost, the equipment becomes useless. For example, the survival indicator of a server.

➣Establishing a server must have one or more important "service purposes" and related systems for those "service purposes."

➣Example 1: A DNS server but the DNS Service is not started, port 53 is unreachable -> DNS server failure.

➣Example 2: The main "survival indicators" of a web server system include WEB Service (Apache, IIS), Java middleware, database application system, DNS server.

➣When the main survival indicator of a server's "service purpose" has problems, even if the other performance of the server is healthy, it is useless (e.g., CPU, memory, disk space, network speed are fast).

➣In addition to the main function survival indicators, there are other operational indicators, including system performance operation indicators, important program indicators, and hardware operation indicators.

The Purpose and Myths of Information Collection

Generating and collecting a large amount of information data is an easy function, but most of these data will only provide information for post-event searches and investigations, not for understanding whether the system is in a normal or abnormal state.
This can lead to system personnel spending a lot of time interpreting and understanding the data content.

The main purposes of information collection include:

➣Control the transparency of information equipment operation or settings
➣Serve as data or behavior for anomaly judgment and alerts
➣Serve as a data reference value for machine expansion or updates
➣Real-time data - war room real-time information, immediate handling
➣Short-term data - assist in performance and status comparison judgment, using the time point for advanced debugging
➣Long-term data - advanced debugging and big data analysis.

Difficulties in Defining Information Collection Include:

➣Information collection of a single device - relatively simple
Using a single SNMP command (60 words) can retrieve over 1,000 pieces of information

➣Integrated related information collection - relatively difficult
Integrating the relevance and dependency of different devices, linking possibly related information items, classifying and statistically appropriate information values according to user habits, such as when looking at server information, also knowing which Switch port this server is connected to, the traffic on this port, speed, etc., the location map and photos of the device are displayed together, such as: traffic comparison of each Switch port.

Defining the Rigor of the Alert Mechanism

➣Defining anomalies with single values - relatively simple
Only using the single obtained value from the monitoring system, such as: An alert is triggered when the CPU load exceeds 80%, but the server's CPU load reaches 81% for 2 seconds every 5 minutes, generating an alert message every 5 minutes, or when doing a backup from 10 PM to 11 PM, the CPU load reaches 90%.

➣Defining anomalies with multiple conditions - relatively difficult
As defined above (A) CPU load exceeds 80% + (B) continuously for 3 times + (C) every 5 minutes interval.
A+B+C must be met at the same time to generate an alert message.
Define no alert messages from 10 PM to 11 PM.

Major Events Occur Every Day

Each system device experiences a serious event once a year, which may seem like a rare occasional event, but if there are more than 300 devices, it means that serious events occur every day. Establishing an automated IT operations system cannot exclude occasional events from monitoring.

Here's an example:

When users retrieve large amounts of data, they find the speed very slow, and system personnel begin to investigate:

1. The user's computer's CPU, memory, network card, hard disk I/O, firewall, and software system all seem fine.

2. The database server's CPU, memory, network card, hard disk I/O, firewall, and software system all seem fine.

3. The network connection and ping response are normal.

4. Ask the network administrator to check the Switch settings, and the network administrator responds that they are normal.

5. Ask the database software vendor to check, and they respond that everything is normal.

After spending an entire day crossing different system fields or departments, the results are all normal, and they can only ask the user to try again.
One day, the problem was suddenly found. It turned out that the network wiring panel had a bad contact point, causing the network to frequently slow down automatically, and while small amounts of data were fine, users felt it when dealing with large amounts of data.

Only comprehensive monitoring can cross professional fields or departments, integrating horizontal related information to effectively track problems, preventing different departments or vendors from shifting responsibility. For example, the application group believes it is a network problem.

Summary:

In exploring the key points for deploying an automated IT operations system, we understand how to provide precise and rapid information service support, monitor the purpose and characteristics of equipment, establish survival indicators, and the importance of integrating and debugging across departments.
By defining "survival indicators," discussing information collection strategies and challenges, and establishing an effective anomaly alert mechanism, this article emphasizes the necessity of comprehensive monitoring and how to effectively track and solve problems, avoiding responsibility clarification between departments.
The establishment of an automated operations system not only improves system reliability and efficiency but also ensures that companies can continue to provide high-quality services in the face of increasing equipment numbers and complexity.
Through the discussion in this chapter, we recognize that only through continuous technological innovation and strategic implementation can companies maintain their competitiveness in a rapidly changing technological environment.

Return to the Previous Page