Projects

DeepRoute: Experimenting with Large-scale Reinforcement Learning and SDN on Chameleon Testbed and Mininet

As the numbers of internet users and connected devices continue to multiply, due to big data and Cloud applications, network traffic is growing at an exponential rate. WAN networks, in particular, are witnessing very large traffic spikes cause by large file transfers that last from a few minutes to hours on network links and there is a need to develop innovative ways in which flows can be managed in real-time.

In this work, we develop a reinforcement learning approach, in particular Upper-Confidence Algorithm, to learn optimal paths and reroute traffic to improve network utilization. We present throughput and flow diversions using Mininet and demo the technique using Chameleon's Testbed (Bring-Your-Own-Controller [BYOC] functionality). This work is initial implementation towards DeepRoute, which combines Deep reinforcement learning algorithms with SDN controllers to create and route traffic using deployed OpenFlow switches.

Demo showing DeepRoute agent exploring and exploiting the different network paths from LBNL to KANS and selecting the most optimal network path considering the different traffic contions such as bandwitdth, throuput and latency. DeepRoute Agent has a global view of all the location(nodes), where each network location corresponds to a node, edges are the connectivity between the nodes and, the edge weights are the distance. Paper1 , Paper2 , Talk.

DynamicDeepFlow: An Approach for Identifying Changes in Network Traffic Flow Using Unsupervised Clustering

Understanding flow changes in network traffic has great importance in designing and building robust networking infrastructure. Recent efforts from industry and academia have led to the development of monitoring tools that are capable of collecting real-time flow data, predicting future traffic patterns, and mirroring packet headers. These monitoring tools, however, require offline analysis of the data to understand the big versus small flows and recognize congestion hot spots in the network, which is still an unfilled gap in research.

In this study, we proposed an innovative unsupervised clustering approach, DynamicDeepFlow, for network traffic pattern clustering. The DynamicDeepFlow can recognize unseen network traffic patterns based on the analysis of the rapid flow changes from the historical data. The proposed method consists of a deep learning model, variational autoencoder, and a shallow learning model, k-means++. The variational autoencoder is used to compress and extract the most useful features from the flow inputs. The compressed and extracted features then serve as input-output pairs to k-means++. The k-means++ explores the structure hidden in these features and then uses them to cluster the network traffic patterns. To the best of our knowledge, this is one of the first attempts to apply a real-time network clustering approach to monitor network operations. The real-world network flow data from Energy Sciences Network (a network serving the U.S. Department of Energy to support U.S. scientific research) was utilized to verify the performance of the proposed approach in network traffic pattern clustering. The verification results show that the proposed method is able to distinguish anomalous network traffic patterns from normal patterns, and thereby trigger an anomaly flag Paper , BEST PAPER AWARD.

NetPredict: Real-time Deep Learning Predictions for Network Traffic Forecasting

Predicting network traffic in advance can impact how engineers configure and utilize network bandwidth to manage available capacity, prevent congestion and better use network resources.

However,in large wide area networks (WANs), network traffic is generated in immense volumes and patterns constantly change based on user behaviors, causing prediction models loose accuracy in the future.

In this work, we demonstrate NetPredict, a real-time data framework for deploying and testing multiple ML models for network traffic prediction. Focused on predicting hourly, weekly and monthly. NetPredict allows users to deploy their deep learning or statistical prediction models to predict link bandwidth and measure how well they are performing in real-time. Through an interactive interface, engineers can select the least used routing paths to prevent congestion. NetPredict also accompanies a trust dashboard that shows how well ML models perform with real-time network traffic.

NetPredict is designed to deploy and test multiple prediction models - statistical and machine learning - and show how well each model performs with real-time data. This helps build trust in the prediction models, to view how good the models perform and allows users/engineers to switch between models for their decision-making.

NetPredict Demo showing the Graphical user interface. A source and destination is been selected and the most optimal network path is calculated considering the different traffic contions such as bandwitdth, throuput and latency. NetPredict is focused on predicting hourly, weekly and monthly, and it allows users to deploy their deep learning or statistical prediction models to predict link bandwidth and measure how well they are performing in real-time. Link.

Deep Reinforcement Learning based Control for two-dimensional Coherent Laser beam Combining

High power lasers are indispensable for various applications in advanced particle accelerators and other applications both in industrial initiatives and fundamental research. The past decade has seen a rapid development of several studies of coherent beam combining (CBC) of fiber lasers demonstrating great potential in breaking through the power limitation of a single laser beam while maintaining good beam quality, controlling the phase of a coherent laser beam and optimal tuning of particle accelerators remains a very challenging problem. There has been an increasing interest in machine learning techniques, especially deep reinforcement learning both in the particle accelerator community and academia, where researchers have proposed approaches to tackle the problem of attaining an optimal working point and performance recovery after the machine drifts. For instance, a common reference beam can be used for phase detection where each input beam interfaces with a fraction in a free space to measure the phase errors. The stochastic parallel gradient descent (SPDG) is another technique used in controlling phase where it randomly searches for an optimum set of values However, all these techniques comes with some instinctive limitations. It is pertinent to note that even though traditional coherent laser beam combining stabilization methods rely exclusively on a reaction to observed data, improving the latency of the feedback loop and controlling the beam intensity remains a challenging task

In this work, we demonstrate deep reinforcement learning based phase stabilization control for two-dimensional filled-aperture diffractive coherent combining. Various approaches were tested on simulation and experimental data for combining 3 × 3 beams using pattern recognition.

Demo showing Snapshot of the real-setup and how stability is achieved keeping the center beam intensity highest. In the real system for example, the noise bandwidth is about 100Hz and the sample rate is kHz, and we have 100s of these laser beams. So,fast feedback is needed to control and stabilize the system against unwanted noise in real-time. The main objective is to always keep the center beam intensity highest irrespective of the environmental pertubations. Paper1 , Paper2 , Paper3.

Dynamic Graph Neural Network for Traffic Forecasting in Wide Area Networks

Wide area networking infrastructures (WANs), particularly science and research WANs, are the backbone for moving large volumes of scientific data between experimental facilities and data centers. With demands growing at exponential rates, these networks are struggling to cope with large data volumes, real-time responses, and overall network performance. Network operators are increasingly looking for innovative ways to manage the limited underlying network resources. Forecasting network traffic is a critical capability for proactive resource management, congestion mitigation, and dedicated transfer provisioning. To this end, we propose a non autoregressive graph-based neural network for multistep network traffic forecasting. Specifically, we develop a dynamic variant of diffusion convolutional recurrent neural networks to forecast traffic in research WANs. We evaluate the efficacy of our approach on real traffic from ESnet, the U.S. Department of Energy's dedicated science network. Our results show that compared to classical forecasting methods, our approach explicitly learns the dynamic nature of spatiotemporal traffic patterns, showing significant improvements in forecasting accuracy. Our technique can surpass existing statistical and deep learning approaches by achieving ~20% mean absolute percentage error for multiple hours of forecasts despite dynamic network traffic settings.

Dynamic-DCRNN (D-DCRNN) model architecture. It takes an adjacency matrix computed from the current state of the network traffic amongthe sites of the WAN network topology and the traffic as a time-series at each node of the graph. The encoder-decoder deepneural network is used to forecast the network traffic for mutiple time steps. Paper.

NetGraf: An End-to-End Learning Network Monitoring Service

Network monitoring services are of enormous importance to ensure optimal performance is being delivered and help determine any failing services. Particularly for large data transfers, checking key performance indicators like throughput, packet loss, and latency can make or break experiment results. However, network monitoring tools are very diverse in metrics collected and dependent on the devices installed. Additionally, there are limited tools that can learn and determine the cause of degraded performance. This paper presents NetGraf, a novel end-to-end learning monitoring system that utilizes current monitoring tools, merges multiple data sources into one dashboard for easy use, and provides machine learning libraries to analyze the data and perform real-time anomaly finding. Using a database backend, NetGraf can learn performance trends and show users if network performance has degraded. We demonstrate how NetGraf can easily be deployed through automation services and linked to multiple monitoring sources to collect data. Via the machine learning innovation and merging various data sources, NetGraf aims to fulfill the need for holistic learning network telemetry monitoring. To the best of our knowledge, this is the first-ever end-to-end learning monitoring service. We demonstrate its use on two network setups to showcase its impact.

The NetGraf architecture comprises of three modules: Network, Data Aggregation, Machine Learning and Visualization modules. Paper, Code, Slides.

Net-Preflight Check: Using File transfer to Measure Network Performance before Large Data Transfers

During bulk data transfer for exascale scientific applications, measuring the available throughput is very useful for route selection in high-speed networks, Quality of service verification, and traffic engineering. Recent years have seen a surge in available throughput estimation tools, especially in Research and Education (R&E) Networks. Some tools have been proposed and evaluated in simulation and over a limited number of Internet paths. However, there is still significant uncertainty in the performance and flexibility of these tools at large. Furthermore, some existing tools' primary concern is the lack of network performance history or a memory that stores previous configurations and network measurements..

This work introduces Net-PreFlight, a simple end-to-end, lightweight tool for measuring available throughput, traceroute, and maintains memory, compared to existing tools like Iperf. Our tool focuses on throughput measurements, flexibility, a retentive memory, security, and performance. We conduct experiments between multiple Data Transfer Node (DTN) setups in isolated and public network setups to measure how throughput measurements fare in the two domains. In all scenarios, Net-Preflight produces comparable results as established tools and hence positions itself as a complementary tool for situations where the deployment of Iperf or perfSONAR is not possible. In addition, Net-Preflight features retentive memory to easily compare past and present measurements. Our analysis reveals that using socket and file transfer protocols performs well in initial measurements and also indicates that parallel TCP streams are equivalent to using a large maximum segment size on a single connection in the absence of congestion. Here, we lay the foundation to build a new monitoring system for DTN bulk transfers that target end-users who require optimum network performance.

Throughput measurement for large data transfer over isolated Network (CHI@UC to CHI@TACC) and public Network (NERSC DTN to CHI@UC DTN).

DeepRoute Mininet Implementation: Herding Elephant and Mice Flows with Reinforcement Learning

Wide area networks are built to have enough resilience and flexibility, such as offering many paths between multiple pairs of end hosts. To prevent congestion, current practices involve numerous tweaking of routing tables to optimize path computation, such as flow diversion to alternate paths or load balancing. However, this process is slow, costly and require difficult online decision-making to learn appropriate settings, such as flow arrival rate, workload, and current network environment. Inspired by recent advances in AI to manage resources, we present DeepRoute, a model-less reinforcement learning approach that translates the path computation problem to a learning problem. Learning from the network environment, DeepRoute learns strategies to manage arriving elephant and mice flows to improve the average path utilization in the network. Comparing to other strategies such as prioritizing certain flows and random decisions, DeepRoute is shown to improve average network path utilization to 30% and potentially reduce possible congestion across the whole network.

This work presents results in simulation and also how DeepRoute can be demonstrated by a Mininet implementation.

Topology used in the Simulation Model (bu = bandwidth units occupied and tu = time units occupied), Q-values for Local and Global levels. Paper

Predicting WAN traffic volumes using Fourier and multivariate SARIMA approach

Network traffic has been a critical issue that has attracted massive attention in network operations research and the industry. This paper tackles the need to understand traffic patterns across a high-speed network topology by developing statistical models that use multivariate data models, incorporating seasonality, peak frequencies, and link relationships to improve future predictions.

Using Fourier transforms to extract seasons and peak frequencies, we perform seasonality tests and ARIMA measures to determine optimal parameters to use in our model. Our study shows that network traffic is non-stationary and possesses seasonality, making SARIMA the most suitable approach. Furthermore, based on network traces collected from multiple time zones of the R&E WAN, our results indicate an improved prediction accuracy of 93.7% from the multivariate model with better RMSE and smaller confidence intervals. Our work provides critical insights into studying network traffic and prediction methods necessary for future research in network capacity management.

Studying network traffic as a time-series problem. using FFT to extract frequency patterns across the traffic traces, to validate the variability among the traces used in this study FFT extracts symmetry frequencies for signals, as finite sum of sine waves. A variation of discrete Fourier transformation equation (DFT), helps extract positive frequencies in the dataset. Paper

Failure prediction using machine learning in a virtualised HPC system and applicationh

Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately.

In this work, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.

Proposed system model and architecture. Failure prediction using machine learning in a virtualised HPC system and application. Paper

Failover strategy for fault tolerance in cloud computing environment

Cloud fault tolerance is an important issue in cloud computing platforms and applications. In the event of an unexpected system failure or malfunction, a robust fault-tolerant design may allow the cloud to continue functioning correctly possibly at a reduced level instead of failing completely. To ensure high availability of critical cloud services, the application execution, and hardware performance, various fault-tolerant techniques exist for building self-autonomous cloud systems. In comparison with current approaches, this paper proposes a more robust and reliable architecture using optimal checkpointing strategy to ensure high system availability and reduced system task service finish time. Using pass rates and virtualized mechanisms, the proposed smart failover strategy (SFS) scheme uses components such as cloud fault manager, cloud controller, cloud load balancer, and a selection mechanism, providing fault tolerance via redundancy, optimized selection, and checkpointing.

In our approach, the cloud fault manager repairs faults generated before the task time deadline is reached, blocking unrecoverable faulty nodes as well as their virtual nodes. This scheme is also able to remove temporary software faults from recoverable faulty nodes, thereby making them available for future request. We argue that the proposed SFS algorithm makes the system highly fault tolerant by considering forward and backward recovery using diverse software tools. Compared with existing approaches, preliminary experiment of the SFS algorithm indicates an increase in pass rates and a consequent decrease in failure rates, showing an overall good performance in task allocations. We present these results using experimental validation tools with comparison with other techniques, laying a foundation for a fully fault-tolerant infrastructure as a service cloud environment.

Fault tolerance architecture. The FT model (FTm) of a cloud computing system is represented by a finite-ordered list of elements or a sequence of five elements. Paper

Time-series prediction for HPC faults on NERSC

Developed an efficient failure prediction data-driven model using machine learning and conducted an investigation into the trend of the failed components in a high-performance infrastructure.
Studied and developed an effective failure prediction model focusing on high-performance and cloud data centres using data in-production from the National Energy Research Scientific computing centre (NERSC).
Developed a time series model for failure prediction, then test and validate the prediction accuracy of the developed model using data collected for storage, networking and computational machines from the NERSC over a five-year period.

Analysis of Google dataset Trace Collected from a 12k-Node cluster (Borg Cluster Management system)

Analysed a publicly available Google dataset containing workload and scheduler events emitted by the Borg cluster management system in a cluster of over 12,000 nodes during a one-month period. Studied how machines in the cluster are managed, and how jobs are scheduled and processed in the cluster. Investigated how the cluster resources are utilized especially, the amount of useful, wasted and idle resources.

Network Security Assessment and Evaluation (BradStack)

Conducted a detailed security analysis of the Brad-Stack environment and addressed the discovered vulnerabilities, ensuring the entire infrastructure is secure. Conducted a comparative security analysis study of an Enterprise LAN setup in BradStack environment against a physical enterprise LAN setup. Followed by a web application vulnerability assessment as well as a comparison between two existing vulnerability assessment tools against BradStack. A comprehensive OpenStack API penetration testing against OpenStack core services was conducted to discover vulnerabilities.

BradStack Building OpenStack from ground up

Deployed a customized OpenStack multi-node installation comprising of the Control, Compute and Network node where each node was configured to have two network interfaces namely the external network and the management network used for connectivity of the nodes. Each node was prepared and configured with the following services (Keystone, Glance, Cinder, Nova, Neutron), and several experiments were conducted (Pentest, Wireshark, network analysis using Packet analyser). Deployed BradStack Cloud infrastructure using the developed OpenStack installation toolkit.