Watchdog IT Operation and Maintenance Monitoring System

10. Preparations Before Establishing IT Operation Management

Before establishing a dedicated IT operation management system, comprehensive consideration and preparation from different perspectives are required.
For example, an information center with 500 servers, 250 network devices, 30 virtual machines (VM), and 60 racks can be considered from the perspectives of system operation engineers, management supervisors, and policy-level supervisors.

➣ System Operation Engineer's Perspective
Responsibilities: Responsible for building, setting up, executing, troubleshooting, responding to inquiries, regular updates, and fault repairs.
Technical Preparation: Ensure that all servers, network devices, and virtual machines are technically prepared, including hardware configuration, software installation, and network layout optimization.

➣ Management Supervisor's Perspective
Responsibilities: Responsible for coordination, integration management, general technology, system management, incident handling, process establishment and execution, technical support, and troubleshooting.
Process Establishment: Develop effective workflows and communication mechanisms to quickly respond and resolve issues during the operation process.

➣ Policy-Level Supervisor's Perspective
Responsibilities: Responsible for creating policies, benefit evaluation and verification, establishing management systems, and bearing responsibilities.
Long-term Planning: Set long-term goals and policies for IT operations from a macro perspective and evaluate their impact on the organization's overall operations.

The following are the main considerations and steps for establishing an IT operation management system:

Pre-planning: Plan for various situations that may affect IT operations to improve the proper rate of IT operations.
Testing and Data Collection: Ensure the basic elements for the normal operation of the system through detailed testing and data collection.

Project Personnel Configuration
➣ Original unit IT management person in charge
➣ Operation manager or project manager (PM)
➣ Senior system technical personnel (including system, network management, application)
➣ Original manufacturer technical personnel

Monitoring Equipment Scope Definition
➣ Inventory all equipment items to be monitored, such as servers, switches, etc.
➣ Confirm the devices that must be installed on the monitored equipment, such as IPMI/ILO.
➣ Coordinate necessary settings and information, such as SNMP activation, NetFlow settings, etc.
➣ Application system integration mechanisms, including abnormal message forwarding, data detection and analysis, etc.

Understand Information Center Resources and Architecture
➣ Network architecture diagram, including infrastructure, subnet control architecture, etc.
➣ Application deployment of servers, including importance classification, survival indicators, etc.

Common Issues After Automation Detection
➣ Definition of monitoring items and abnormal thresholds.
➣ Efficiency, rigor, and sensitivity of alarm issuance.
➣ Alarm notification channels and lists.
➣ Establishment of emergency response plans.
➣ Establishment of important related integration points and related information.

Design Monitoring Goals Based on Different Levels of Equipment and Systems:
Server Monitoring Items
Server monitoring refers to real-time monitoring and management of the operating status of servers (whether physical or virtual) to ensure the stable operation and high performance of the system. Server monitoring items can be subdivided from multiple levels:
➣ Hardware Layer - Main Operation: CPU usage, RAM, disk space, and temperature, timely detection of hardware failures or performance bottlenecks.
➣ Operating System Layer - Main Performance: Detect the running status of the operating system, including process management, system load, logged-in users, etc., to ensure the operating system runs properly.
➣ Network Connection Layer - Survival Indicators: Confirm the reachability of the server through ping tests and other mechanisms, and monitor the survival status of network connections.
➣ Network Connection Layer - Busy Indicators: Analyze network traffic and bandwidth usage to identify network bottlenecks or abnormal traffic.
➣ Network Connection Layer - Quality Detection: Detect network delay, packet loss rate, etc., to evaluate network quality.
➣ Application System Layer - Activation Execution: Monitor the running status of key application services to ensure services start and run normally.
➣ Application System Layer - Auxiliary Monitoring: Monitor performance indicators of applications, such as response time, transaction volume, etc.
➣ Application System Layer - Message Communication: Monitor application system logs and alarms to ensure abnormalities are identified and reported in time.
➣ Application System Layer - Advanced Usage: Deeply analyze the usage patterns of application systems, optimize system configuration, and resource allocation.
➣ Information Security Layer - Potential Crises: Monitor security threats, such as unauthorized access, malware, etc., to protect the server from attacks.

Network Equipment (Switches/Routers) Monitoring Items
➣ Traffic Information
➣ Status Information
➣ Configuration Information
➣ Security Detection
➣ Cascade Architecture

Central Monitoring Items
Central monitoring refers to monitoring the health status of the entire network and system from a centralized location to quickly identify and resolve issues.
➣ Network Connection: Centrally monitor the connection status of all servers and network devices to ensure smooth network connectivity.
➣ Forwarding Mechanism: Monitor the efficiency and stability of data transmission and information forwarding to ensure information is transmitted correctly and quickly.
➣ Specific Information: Collect and analyze information for specific monitoring needs, such as performance indicators of specific applications.
➣ Application System Integration: Integrate different application systems and services into a single monitoring platform, providing a unified monitoring view.
➣ Network Security: Centrally monitor the security status of the network and system, including intrusion detection, abnormal traffic analysis, etc.
➣ System and Event Logs: Collect and analyze system logs and event logs centrally for fault diagnosis and performance analysis.
➣ Auxiliary Environment Monitoring: Monitor environmental conditions of the server room, such as temperature, humidity, and power status.
➣ Emergency Response: Establish emergency incident response mechanisms, including automatic alarms, incident handling procedures, and emergency communication plans.

Return to the Previous Page