Pioneering time-series anomaly detection for computational clusters

We are developing a new machine learning-based solution for time-series anomaly detection of computational clusters. The solution uses machine learning algorithms that are trained on real-world datasets of time-series data. The goal is to identify anomalies with high accuracy and low false alarm rate, achieving auto-healing capabilities. The solution has the potential to help businesses improve the reliability, performance, and security of their computational clusters by preventing downtime.

Learn More
Web development symbol </> representing code and programming languages used for building web applications
Benefits of machine learning anomaly detection

Enhanced Performance Monitoring

Enables quick real-time response for optimal performance

Improved Security

Actively detects and mitigates breaches and threats

Cost Efficiency

Prevents costly failures through early anomaly detection


Adapts to growth, ensuring consistent performance

Data-Driven Insights

Provides valuable insights to guide strategic planning and decisions

Proactive Maintenance

Predictive analysis reduces unexpected failures and maintenance costs

Customization and Flexibility

Offers tailored solutions to meet specific needs and requirements

Environmental Sustainability

Enhances energy efficiency, contributing to sustainable operations

Collaborative Innovation

Fosters collaboration to drive innovation and develop new solutions

Python logoJupyter logoMLFlow logoTensorFlow logoGoogle Cloud Platform logoKubernetes logo

Time-series data of computational clusters

Time-series data of computational clusters, such as those in Kubernetes environments, capture a dynamic stream of information detailing the performance metrics, resource utilization, and operational behaviors of the cluster. These data sequences provide valuable insights into cluster activities over time, revealing patterns, trends, and irregularities that might otherwise remain uncovered.

Anomaly detection systems may harness the power of machine learning to thoroughly analyze complex time-series data, swiftly identifying outliers and enabling proactive responses to potential issues. This way Ops teams can make informed decisions and take action before any problems arise, optimize resource allocation, ensure high availability, and guarantee uninterrupted operation of computational clusters.

Anomaly detection in time-series data is typically formulated as the identification of outlier data points that deviate from the expected or normal signal. Here are some examples of outlier types:

1. Point anomaly

A single data point that is significantly different from the surrounding data points, often referred to as "strange points" or "outliers".

2. Contextual anomaly

Sequences of numbers within a time series that might not appear unusual in isolation, but become anomalies due to unexpected patterns when compared to historical data.

3. Collective anomalies

Involve multivariate patterns that do not seem strange individually, but collectively give a sense that something is unusual in the dataset.

4. Concept drift

Refers to a gradual and consistent shift towards a new state, which might itself be an anomaly warranting detection. This phenomenon can be described as an unusual or unexpected drift.

5. Change Point Detection

Detect abrupt shifts ("change points") in time series. Flag these significant changes and maintain flags until a new normal pattern emerges. Often seen as sudden and unusual steps in the data.

6. Seasonality

Characterized by a regular, repetitive pattern in data that follows a consistent time interval. For instance, sales data commonly displays seasonality, with peaks occurring during holiday seasons and specific times of the year.

Printed Circuit Board

Outlier detection

Computational clusters, defined as a group of machines and accompanying resources that mimics a single system enabling high availability, are quite complex in their nature. At the same time such complexity brings the possibility of determining and measuring numerous data to track signals for outlier detection. Time-series data that doesn’t follow the expected behavior, classified as outlier, maybe grouped and analyzed depending on the context. Values representing spikes or sustained periods of high usage of CPU, memory, disk or network may be suspicious when compared to other values in time-series data.

Considering Kubernetes clusters observations can be extended and take into account the specifics of the solution. Monitoring increase or decrease of different entities like pods or events provides additional insights into the overall system health.

Dashboard with web analytic metrics

Machine Learning powered actions

Incorporating auto-healing capabilities to the anomaly detection system enhances its overall effectiveness and ensures a more resilient computational cluster environment. For instance, an auto-healing mechanism that can take corrective actions in response to detected anomalies becomes a pivotal asset. This involves the establishment of pre-defined protocols and actions tailored to distinct anomaly types, such as the allocation of supplementary resources to an under performing node or the rebooting of a malfunctioning service.

Taking this a step further, the machine learning models may predict potential impact and outcomes of different corrective actions based on historical data and cluster behavior. When seamless *Ops integration is imperative, the system autonomously triggers appropriate actions upon the detection and validation of anomalies, thereby minimizing the necessity for manual intervention.

Moreover, through the integration of machine learning with auto-scaling mechanisms a dynamic resource adjustment becomes available to optimize resource allocation.

By integrating auto-healing capabilities with the anomaly detection system, the computational cluster issues may be proactively addressed. This in turn can reduce downtime, and enhance the overall stability and reliability of the computational environment. This comprehensive approach not only excels in identifying anomalies but also extends to an immediate response that effectively mitigates their impact.

Close-up of a computer screen displaying a magnified view of a mouse cursor with visible pixels

Specificity vs flexibility

Anomaly detection of computational clusters can be achieved by defining a rule set or applying machine learning techniques. Both approaches have their advantages and disadvantages, with the following major differences:

  • Rule sets are simple and transparent, but they can be brittle and inflexible. They may not be able to detect complex or evolving anomalies.
  • Machine learning models are more adaptable and scalable, and they can uncover complex and evolving anomalies. However, they can be more difficult to understand and interpret.

Our research team is striving to leverage the strengths of machine learning to provide a comprehensive anomaly detection system that adapts to the dynamic nature of modern cluster environments.

1. Nature of Detection

Rule-based anomaly detection relies on predefined thresholds and patterns to flag anomalies based on explicit rules and conditions.

Machine learning employs algorithms to learn patterns from historical data and adaptively identifies anomalies based on learned patterns, accommodating dynamic and evolving cluster behavior.

2. Adaptability

Rule-based approaches can struggle to adapt to evolving cluster dynamics, as manual adjustments are required to accommodate new behaviors.

Machine learning models continuously learn from new data, automatically adapting to changes in cluster behavior without manual intervention.

3. Complexity Handling

Rule-based systems may struggle with complex and subtle anomalies that do not fit predefined rules, leading to missed detections.

Machine learning models can capture complex relationships in data, detecting anomalies that might not be captured by simple rules.

4. Scalability

Rule-based approaches can become unwieldy and difficult to manage as clusters grow, requiring the creation and maintenance of numerous rules.

Machine learning scales efficiently as the cluster size increases, accommodating larger datasets without a proportional increase in rule complexity.

5. False Positives

Rule-based systems may generate false positives if rules are too strict, leading to unnecessary alerts and resource allocation.

Machine learning models can learn the normal variations and reduce false positives by considering the broader context of the data.

6. Anomaly Detection Precision

Rule-based methods might miss subtle anomalies that fall outside the scope of predefined rules.

Machine learning models, trained on diverse data, can capture a wider range of anomalies with higher precision.

7. Human Intervention

Rule-based systems often require manual intervention to update rules and adapt to new anomalies, leading to delays and potential oversights.

Machine learning reduces the need for frequent manual adjustments, streamlining the detection process and enabling faster response times.

8. Unsupervised Discovery

Rule-based systems rely on pre-existing knowledge of potential anomalies and patterns, limiting their ability to discover new, previously unknown anomalies.

Machine learning models have the potential to uncover novel anomalies through unsupervised learning, identifying patterns not covered by explicit rules.

Monitor displaying multiple selected color schemes for an application design
Anomaly detection in practice

Assuming that a pre-trained machine learning model is available, anomaly detection is an ongoing process that requires continuous monitoring, validation, and improvement.

After the machine learning model is trained, the main focus is on:

  • Maintaining and improving its accuracy
  • Adapting to changes in the cluster's behavior
  • Ensuring that the anomaly detection process is aligned with the ever-changing needs of the IT environment and the business

Data Collection and Preprocessing

Continue collecting new time-series data to keep the model updated with the latest cluster behavior.


Online Learning and Adaptation

Implement mechanisms for the model to adapt to changes in cluster behavior over time using new data.


Threshold Adjustment

Regularly review and adjust anomaly score thresholds based on model performance and changing business needs.


Real-Time Monitoring

Integrate the trained model into the real-time pipeline to process data, identifying cluster anomalies.


Alert Management and Reporting

Maintain the alerting system for detected anomalies and improve reporting and visualization capabilities.


Human-in-the-Loop Verification

Keep the mechanism for human verification active, refining accuracy and reducing false positives.


Feedback Loop and Model Improvement

Continuously gather feedback from analysts to fine-tune the model and address shortcomings.


Scalability and Resource Management

Ensure that the detection system remains scalable and resource-efficient as the cluster evolves.


Security and Privacy

Maintain and update security measures to safeguard sensitive data during the anomaly detection process.