Cloud Computing Task Scheduler with Deep Reinforcement Learnning via Graph and Attention Mechanism

Abstract

Given the increasing production of data on the Internet and the growing volume of processed data, computer-based approaches relying on distributed computer systems are experiencing a growing popularity. On the other hand, the presence of Internet of Things (IoT) networks and the high volume of generated data further indicate the fundamental need for technology to increase data processing speed in distributed systems. Responding to all user requests in the most optimal and fastest way possible has been a major concern for researchers in recent years. One way to achieve this efficiency is by creating an appropriate scheduler that assigns suitable hardware and a proper order to each request based on the transmitted commands. The absence of a suitable scheduler for cloud systems leads to a loss of numerous requests and, on the other hand, inefficient allocation results in slower processing speeds, increased costs, energy wastage, and environmental damage. It is evident that the presence of a suitable scheduler not only significantly affects the system’s performance but also leads to increased lifespan of hardware in the network and reduced energy consumption.

Most existing studies in this field have considered various objectives such as increased responsiveness, reduced user costs, energy consumption reduction, load balancing in resource usage, increasing the system’s responses to transmitted requests, and increasing accessibility to create a more efficient scheduler. To achieve this efficiency, different greedy, heuristic, meta-heuristic, machine-learning-based algorithms, deep learning algorithms, and reinforcement learning algorithms have been used so far.

In the present study, a new proposal based on deep reinforcement learning is introduced, which utilizes graphs to enhance and improve its performance. Since processing graph data requires graph-based neural networks, the next sections also address topics related to graph mining in addition to deep reinforcement learning. Additionally, the initial studies conducted in the field of resource management in cloud systems have highlighted the need to establish attention-mechanism based networks, hence the use of transformer models as decision-making models in reinforcement learning models to add the ability to discover relationships between transmitted requests and their execution order with the available resources. Furthermore, the present study includes evaluations, implementations, and proposals for existing cloud system simulators, which will be discussed further.

Summary

Cloud computer systems are a type of distributed service delivery system in terms of hardware and software. Nowadays, they are serving based on four paradigms: Hardware as a Service (HaaS), Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Each cloud system falls under one of these four paradigms and relies on existing resources in the cloud to provide services. Typically, each cloud environment consists of multiple data centers, with each data center containing several virtual machines. Additionally, cloud environments can include storage components. In addition, Cloud environments exist in three categories: public, private, and hybrid. In the hybrid category, cloud services are provided between the public and private domains.

Although the concept of cloud computing is not new in the computer world, the need for using this type of computing is increasing. High costs of hardware maintenance and repair, faster growth of computational needs compared to hardware capabilities, the large volume of data produced in various scientific and technological fields, and the proliferation of big data have created a high demand for accessing cloud systems. Traditional database approaches have migrated to distributed systems in order to synchronize with the speed of data production. Moreover, the high capabilities of cloud and edge computing in creating and providing services in all computer fields, along with their constant accessibility, have attracted a large number of users and led to widespread service provision.

In cloud computing, users can rent the necessary resources and send their requests to the system. User requests are fulfilled by allocating a virtual machine in the cloud infrastructure. On the other hand, system resources are permanently assigned to these virtual machines and released accordingly.

The allocation of a virtual machine involves deciding which host to send a request to and with what amount of resources, in order to optimize the goals. This is one of the main challenges in the world of cloud computing that is faced every day.

User requests(jobs) are divided into multiple tasks, and each request is identified using a directed acyclic graph, where nodes represent tasks and edges represent dependencies between tasks in a request. The outgoing edges represent the output data, and the incoming edges represent the input data for a specific node. Therefore, a task cannot be executed unless all the required data are available from the previous stage. Typically, requests have very large and complex graphs. In cloud environments, job scheduling refers to the process and strategy of allocating and releasing resources to the submitted requests in a specific order.

The rapid growth of high-capacity cloud environments on one hand, and the increasing demand for using these environments on the other, have created a very high complexity in implementing a suitable mechanism for job scheduling. Finding an optimal solution when the number of requests is high and the cloud volume is extensive becomes very challenging. Inefficiency in job scheduling can lead to time and infrastructure waste, as well as significantly lowering system performance. Since the dynamic programming approach can provide an approximate solution for the scheduling problem, it is categorized as a NP_Hard problem.

Given the points mentioned above, there are three topics to consider when discussing scheduling problem. The first is resource allocation, in which goal is to allocate nonidentical and distributed resources to workload tasks in order to achieve optimization goals. The second topic is the proper scheduling of tasks on a resource. Incoming tasks to a resource have different execution times and numbers of instructions, and placing them in a correct order to maximize resource utilization and reduce delay and energy consumption has a significant impact on the quality of scheduling. The third topic is the capability of considering DAG efficiently. Most of the conducted research has chosen one or two of these topics to address in their studies, as studying all three topics together would be very challenging. Due to the complexity, the initial focus of this research is on the first and third topics, i.e., solving the problem in a way that resource allocation is properly performed and the task graph is considered.

Deep reinforcement learning is an approach that has the ability to process and tackle large-scale problems. However, it is important to establish a structure that can provide a better method for dealing with the large amount of information related to different aspects of the problem, such as requests, tasks, virtual machines, physical machines, etc., in order to not only achieve multi-objective optimization but also address the current issues in the field of cloud resource management. The ideas and innovations observed in the concept of deep reinforcement learning assist us in solving such problems. Therefore, from the very beginning of encountering the problem of cloud resource management and creating a scheduler, it was decided to specifically focus on the deployment of deep reinforcement learning. The next question was how to structure it to best address the scheduling problem. It seemed that finding connections between tasks and/or tasks and virtual machines could be helpful. On the one hand, dealing with tasks in a queue can either ignore many priorities or consume a relatively large amount of computations and time to consider them. On the other hand, considering requests in a queue may not provide us with the ability to find existing relationships between requests. Based on these interpretations, it seems that creating a graph structure and using graph neural networks, and then applying a transformer network to simultaneously consider the entities, would outperform existing methods. Since a graph has a relatively open structure, it can accommodate a large volume of information, and using a transformer network can help us consider all existing requests in the decision-making process. The power of defining graphs allows us to represent the information about the nodes as a vector and provide a complete computerized meaning to the information, thus enabling the understanding of connections between nodes.

Considering the points raised so far, to present an appropriate solution, this thesis intended to explore a three-stage model in which each part performs its function separately and ultimately the output of each stage impacts the other stages. Simulation and graph creation, graph representation, and decision-making using transformers, which are trained by deep reinforcement learning, were the proposed three stages.

In graph creation phase, the goal was to create hyper-graph that are updated based on new requests or decisions made in the network transfer step. In the second phase, an operational representation is applied to the graph nodes, which enhances the effectiveness of using the graph in transfer models. Finally, in the third phase, decision-making is conducted with the help of transformers. The output of the third phase is fed into the data simulation cloud, and then the existing hyper-graph are updated.

Considering that the inputs of the third phase are vectors resulting from the representation of graph nodes, it was necessary to use a deep model capable of learning and considering the relationships between nodes. Therefore, a transformer model is used for decision-making. During the implementation stages of the proposal, we encountered multiple constraints, including time and hardware limitations. Despite significant progress in the required implementations, changes had to be made in the implementation process to better solve the problem. In the final implementation, the stage related to the graph and graph representation was omitted to reduce the complexity of the problem and focus more efficiently on the research. However, since a significant portion of the conducted research focuses on the first proposal and the necessary implementations have been carried out for it, Chapter 3 addresses this topic. Additionally, to test the use of transformer models and deep reinforcement learning algorithms in similar problems, a complete research stage was conducted on using transformer models with reinforcement learning algorithms in the multidimensional multiple knapsack problem.

Results

For the experimental phase of the algorithm, a comparison was made between the proposed algorithm and several introduced and implemented algorithms in the Tuli et al 2021 article. The performance of the proposed algorithm, called TRLScheduler, can be observed in the following graphs. To better compare each algorithm, it was executed three times, and the results of all three executions are visible. The experiment phase was conducted for ten machines. Some algorithms, such as the GOBI algorithm and the Genetic algorithm, can only be executed when the number of available containers is equal to the number of available machines. This creates a limitation in real-world conditions. Therefore, in this study, the model was trained on twenty containers to demonstrate that the proposed model does not have such a limitation, and it was executed on both ten and twenty containers during the experiment phase to achieve better comparison ability. It is obvious that since the model was trained on twenty containers, its performance will decrease when executed with the limitation of ten containers, which is easily visible in the graph of the number of processed containers.

As shown in the graphs, the proposed algorithm has performed better in terms of average response time and average migration time compared to all other algorithms. Also, in the graph of the number of processed containers, the proposed algorithm has outperformed other algorithms in the case of twenty containers.