Machine Learning Aided Air Trafﬁc Flow Analysis Based on Aviation Big Data

—Timely and efﬁcient air trafﬁc ﬂow management (ATFM) is a key issue in future dense air trafﬁc. The emerging demands for unmanned aerial vehicles and general aviation aircraft aggravate the burden of the ATFM. Thanks to the advanced automatic dependent surveillance-broadcast (ADS-B) technique, the aerial vehicles can be tracked and monitored in a real-time and accurate manner, providing possibility for establishing a more intelligent ATFM architecture. In this paper, we ﬁrst form an aviation big data platform by using the distributed ADS-B ground stations and the obtained ADS-B messages. By exploring the constructed dataset and mapping the extracted information to the routes, the air trafﬁc ﬂow between different cities can be counted and predicted, where the prediction task is implemented on the basis of two machine learning methods, respectively. The experimental results based on real-world data demonstrate that the proposed trafﬁc ﬂow prediction model adopting long short-term memory (LSTM) can achieve better performance, especially when abnormal factors in trafﬁc control are considered.


I. INTRODUCTION
D UE to the unprecedented development of the civil aviation industry, efficient and comfortable air travel has become the priority choose for more and more passengers. The rising demand for air travel and the accompanying rapid growth of flights will inevitably lead to the air traffic congestion in limited air space, which will burden the work of air traffic management (ATM), and pose formidable challenges to air traffic surveillance systems. Therefore, it is urgent for establishing the more intelligent and high-performance air traffic surveillance system, where real-time and high-precision air traffic flow management (ATFM) is a key issue [1]. Traditional air traffic surveillance techniques such as primary surveillance radar (PSR) and secondary surveillance radar (SSR) need to be evolved, as their relatively low tracking accuracy cannot satisfy the future crowded space [2]. The emergence of automatic dependent surveillance-broadcast (ADS-B) has brought tremendous changes to traditional radar-based surveillance technologies [3], [4]. The ADS-B has been considered as a crucial part of the next generation of air traffic surveillance system. The ADS-B technology can be used in many fields such as target recognition [5], multilateration (MLAT), and ATFM (as this study focuses on) [6]. The main work of ATFM is to assist the air control department to make timely and appropriate evaluations whether the air traffic is close to the upper limit, so as to ensure the rationality of the air traffic and to maximize the utilization of the airspace capacity [7].
As the prerequisite for subsequent processing, data analysis and visualizations, reliable and accurate air route flow information plays a crucial role in ATFM system [1]. Until now, most air-traffic controllers and relevant departments still rely on traditional technologies like radar, which is more expensive and less accurate in locating aircrafts. Besides, the two-dimensional radar screens cannot characterize the air route flow information quite well. The accuracy and comprehensiveness of the ADS-B message make it a high-quality data source for air route flow statistics. By using the millions of ADS-B data continuously uploading from our aviation big data platform, we can carry on the statistics of the air route flow in arbitrary period between arbitrary two airports as well as some visualization tasks. Furthermore, precise flow prediction model can help the air traffic control departments to understand the possible trend of the flow in the near future. Thus, the surveillance system can provide support for more reasonable scheduling strategies. The work of flow prediction can be implemented by using historical flow data obtained from previous statistic tasks. Conventional work concerning prediction methods for time series analysis includes Kalman filter-based method [8], auto-regressive moving average (AR-MA) method [9], auto-regressive integrated moving average (ARIMA) method [10], etc. However, the conventional methods show their weakness in capturing complex nonlinear timedependent features, especially in dealing with big volume time series data covering high-dimensional features.
Nowadays, machine learning methods have been applied in time series analysis as its great power in searching complex structures and matching inner relationships between objects. A multilayer perceptron neural network was applied for wind speed prediction for intelligent wind power generation [11]. For example, lots of machine learning based methods were utilized to heterogeneous network traffic control [12]- [15], radio resource assignment techniques [16]- [19], and physical-layer wireless techniques [20]- [22]. A model combining stacked auto-encoders and neural networks was employed in passenger  flow prediction [23]. This paper explores the researches on statistics and prediction of the air route traffic flow based on big volume real ADS-B message. The flow prediction task can be implemented based on two machine learning methods: support vector regression (SVR) [24] and long short-term memory (LSTM) [25]. We find that both the SVR and LSTMbased prediction models can adequately predict the air route flow, and better performance can be obtained by the LSTMbased model using big volume dataset. The main contributions of this paper include the following points.
• The geometrical model for air traffic flow statistics is constructed, which can be used for statistic and visualization tasks according to different granularity.

A. ABS-D Based Aviation Big Data Platform
Over the last decade, the number of aircrafts boosts in limited airspace, which has inevitably caused the problem of air congestion. Thus, the requirement of ensuring the safety and efficiency of air transportation has brought new challenges to the ATM [26], [27]. Aircraft collision may occur in crowded airspace, which poses serious safety concerns to millions of air passengers. The next generation of ATM system equipped with ADS-B platform can handle the air traffic congestion problem more efficiently, and thus can provide safer air traffic networks [28], [29].
The ADS-B technology is an aircraft operation surveillance technology based on ground-air and air-air communication data links and the global navigation satellite system (GNSS).The   the GNSS and other basic information (e.g. ICAO number) of each flight. While the ADS-B IN subsystem provides operation support and enhanced situational awareness, such as conflict warning information, collision avoidance strategy, and meteorological information [30], [31]. Fig. 1 shows the graphical illustration of our practical aviation big data platform based on the ADS-B system. As shown in Fig. 1, the positional information obtained from the GNSS and other situational information of each aircraft can be broadcasted by the ADS-B OUT transmitters. The surveillance stations deployed on the ground and aircrafts equipped with the ADS-B IN subsystems can receive these ADS-B messages over the 1090 ES data link [28]. The ground surveillance center consists of three parts: the central cloud server for storing ADS-B data, the data processing center for data mining, and the data visualization devices for characterizing the obtained information.
As it is shown in Fig. 2, visualization tasks can be accomplished after a series of processes in our date processing center. The visualization includes maps, trajectories of the aircrafts, and real-time flight information. The utilized aviation big data platform based on the ADS-B system can play an important role in establishing a more advanced ATM system.

B. Key Issues in ATM and ATFM
The main aim of the ATM is to make full use of the existing airspace and routes, and to ensure the safety and efficiency of the flights. Thus, the ATM consists of three main tasks: air traffic control (ATC), air space management (ASM), and ATFM [32]. The three branches of the ATM are shown in Fig.  3.
ATC: It is defined as a service provided by air traffic control department. The service includes flight operation advisory service, aircraft information service, and warning service (e.g. aircraft collision warning). Additionally, aircraft surveillance and enforcement of relevant rules such as separation rules are also within the scope of the ATC.
ASM: It mainly manages and optimizes the given airspace conditions in order to maximize the use of the airspace. The work of ASM contains airspace classification, airspace use assessment, airspace optimization, and no-fly zones setting [33]. The narrower separation standard brought by more accurate localization and tracking technique can improve the capacity of the limited airspace.
ATFM: It is defined as the regulation of the amount of aircrafts in certain airspace. Because of the strict separation distance requirement between two arbitrary aircrafts, the reasonable supervision and regulation is the guarantee of efficient utilization of the available airspace. The tasks of the ATFM include air traffic flow information service, certain airspace (such as air route and airport sector) management, and air traffic flow optimization [34].
The original form of the ATM is based on procedural control system in which the aircrafts must strictly follow the flight schedules. And with the introduction of radar technology (i.e., PSR and SSR), it evolved to a radar-control based system that immensely improved the flexibility of the aircrafts and the control system. However, the traditional radar-control based ATM system cannot meet the requirements of the gradually crowding airspace with the vigorous development of the civil aviation industry.
Continuous decline in punctuality rate of the airports makes air traveling experience bad, and the delays and cancellations of the flights have resulted in huge economic losses. The solutions to these problems necessitate a more accurate and intelligent ATM system. Air traffic flow statistics and prediction, the main two parts of this work based on the aviation big data platform, play a significant role in improving the accuracy and intelligence of the ATFM, respectively. The real-time messages provided by our ADS-B based aviation big data platform can be an accurate data source for air traffic flow statistic task, which is helpful for optimizing the flight scheduling strategy.
Furthermore, with the application of deep learning in our aviation big data platform, we are able to obtain a more accurate prediction model based on the statistical data of the air traffic flow. Thus, the statistical and predicted information can provide an assistance to support future more intelligent flight scheduling strategies and ATM.

III. AIR TRAFFIC FLOW STATISTIC BASED ON ADS-B MESSAGES
The air traffic flow is defined as the number of aircrafts in a specific airspace within a certain time period. Although the total air traffic flow is relatively low in China, the distribution of the air traffic is extremely uneven due to the regional development difference. The aircraft traffic mainly concentrates in relatively developed regions and some tourist cities. Therefore, we can expect much more congestion and delay in these cities airspace with the continuous growth of flights. The congestion and delay will cost a lot of manpower and resources.
The ADS-B messages collected by our aviation big data platform can provide more accurate and comprehensive general information, including ICAO number, position, velocity, and angle information. As the aircrafts equipped with ADS-B transmitters automatically broadcast the ADS-B messages, there are sufficient data over time to facilitate the air traffic flow statistic task.
The nearly a million pieces of data from the ADS-B ground stations we have deployed can support the subsequent data processing and analyses, and thus to capture the features and trends of the air traffic. Fig. 4 shows the main workflow of the air traffic flow statistic task which can be divided into four modules, namely ADS-B receiver module, pre-processing module, statistic module, and display module. First, we collect the data by using the distributed ADS-B receivers and store them in the data center. Regarding to the pre-processing module, we extract data from the data center and deal with them with a series of operations: data clean, information extraction, outliers processing, chronological order, data partitioning, and data indexing. The processed data is eventually re-stored to the data center waiting for subsequent operations in the statistic module; the operations include data extraction, coordinate transformation, air routes generation, and positional and existential verification. Finally, we can get the air route traffic information and render it on the display module.

A. Pre-processing Module
To create an available dataset for our further study, some data pre-processing methods have been implemented on the obtained big volume ADS-B messages. The processes are as follow.   First, we divide the collected ADS-B messages by date and flight number, so we can get data arrays of different flights in every single day. Each item of the data arrays contains the content of the complete ADS-B information including ICAO identity number, latitude, longitude, speed etc. To eliminate the impact of some invalid data, we also need to filter some abnormal data due to possible unstable reception, for example, garbled characters and coordinate values exceeding the actual range. And we remove all zero values existing in the ADS-B messages. Then we extract the information needed for flow statistic, including flight numbers, latitude and longitude information, and etc. And the truncated data items are sorted by time. Thus, the data items are separated by the index of time, date, and flight number, respectively. After that, they are stored in the data center waiting for subsequent processing.

B. Statistic Module
To obtain comprehensive air route information and to enrich the dataset for a better air traffic prediction task, we define each air route as a corridor with a width of 10 km that connects one specified airport to another, which is shown in Fig. 5. The height of the corridor is assumed stretching from the mean sea level to an infinity altitude. Thus, we can calculate the number of flights within a specific air route over certain time period, namely, the statistical traffic flow of the route. The specific steps of the statistic task are as follows.
Initially, we determine the evaluation area of a specific air route by the specific range of longitude, latitude and altitude. Those parameters are defined by the positional information of the ends of this air route. Also, the defined ranges are prepared for subsequent calculation and verification.
Next, we retrieve the data in a certain period from the data center and divide them into hourly separated intervals for making reasonable comparisons at different time intervals on a single air route.
Finally, we traverse the data slice with hourly period separation to count the flights in this air route. Before we add the ICAO number of a flight appearing in air route to obtain a set called flight-in-route, there are still two verification steps: existential verification for checking whether this flight number is already added; positional verification for checking whether the position of this flight is within the air route.

IV. AIR TRAFFIC FLOW PREDICTION METHODS
The study of prediction task is an important part of the field of time series. Traditional time series prediction models can be divided into two categories: mathematical statistics based prediction methods such as ARIMA [35], and nonlinear prediction methods such as chaotic model [36].
With the development of machine learning, a series of artificial intelligence algorithms such as neural networks [37] and support vector machine (SVM) [38] have been widely used in the field of time series prediction. These modern algorithms show superior performance in exploring data with hidden features.
In this section, we propose two prediction models based on the SVR and LSTM, respectively, which are trained by the massive ADS-B data obtained from our aviation big data platform.

A. Features Selection
The input dataset plays a significant role in any predictor based on machine learning, and the performance of the prediction model can be broadly determined by the selection of the data features. According to the related work, the selected features mainly include time-series features such as date and time [39], [40].
As the air route flow is a time series, the time-related basic features are considered by us alongside some factors that may affect the air route flow. Thus, we can form an input vector x containing these features. We first define two vectors denoting two types of features, namely time vector t and impact vector p, which are written as (1) where t 1 , t 2 and t 3 denote the hour of day, the day of week, and the day of month, respectively. p 1 , p 2 , and p 3 denote the holiday index, season index, and average flow, respectively. Hence, the input vector x can be denoted as: where f denotes the flight number within the air route, r denotes the air route, t and p are the time vector and impact vector, respectively. Generally, it is hard to extract the complex nonlinear relationship between the air route flow and the above features by using traditional time series prediction models, let alone to deal with the big dataset. Therefore, it is necessary to apply some artificial intelligence algorithms to better use the massive data and obtain more accurate prediction results.

B. SVR-based Flow Prediction
The SVR is an extension of the SVM, and it can bring in efficient solutions for regression problems. In summary, the SVR converts a nonlinear problem into a linear problem in a high-dimensional space, and calculates the complex features of the high-dimensional space through the kernel functions [41], [42]. For a pair (x, y), traditional regression models typically calculate the loss by the computing the difference between the predicted f (x) and the true value y, and the loss is zero only if they are identical. In contrast, the SVR assumes that we can tolerate a threshold as the maximum discrepancy between them [43]. The SVR problem can be formulated as follows: where C denotes a regularization constant, and l theinsensitive loss function expressed as: Thus, the loss is calculated only when the absolute value of the difference between the f (x) and y is greater than . This is equivalent to constructing a band with a width of 2 centered on f (x). If the sample falls into this zone, the regression result is correct.

C. LSTM-based Flow Prediction
The LSTM is evolved from the recurrent neural network [44], and it is suitable for processing and predicting events with relatively long intervals and delays in time series [45]. One of its advantages is that it can avoid the gradient vanishing problem in traditional recurrent neural networks. The LSTM shows gigantic power in the field of natural language  processing, target recognition, and sound detection [46]- [48]. Additionally, there is unique structure named gated neuron existing in the LSTM cells. The structure allows short-term and long-term memories be captured, which makes the LSTM suitable for time sequence prediction tasks.
The structures of the LSTM cell can be explained by four forget gates. The output of the LSTM network can be calculated by the following functions [49], [50]: where x (t) is the input of the model at time t, W and b denote weight matrices and bias vectors, respectively. i t , f t , c t , and o t denote the four different gates, namely input gate, forget gate, candidate gate and output gate. h t denotes RNN hidden layer state h = [h 1 , h 2 , ..., h t ]. Fig. 6 shows the architecture of the LSTM-based model, which includes three layers: the LSTM layer, the fully connected layer, and the dropout layer. The LSTM layer functions as capturing the time correlation among the air traffic states at different time. The time-steps set in our experiment is 24, which suggests LSTM layer unfold as 24 layers in the time domain. The dropout layer devitalizes neurons with a certain probability, which can improve the generalization ability of the networks. The fully-connected layer is used to reshape the output as the expected form and the active function is rectified linear unit (ReLU) function.
This work is about a regression problem for time series. Therefore, we compare the true statistical flow data y with the predicted valueỹ by calculating root mean squared error (RMSE) and mean absolute error (MAE). These metrics are defined as follow: Here, smaller values of these metrics mean the better performance of the proposed prediction models.

A. Setup
The ADS-B omnidirectional antenna in each ADS-B ground station can receive ADS-B signals broadcasted from every flight equipped with an ADS-B transmitter. The coverage of each antenna is within 300km in diameter. We have deployed 14 ADS-B ground stations in Northeast China, North China and East China, and the number is increasing. We have collected all the ADS-B data from December 2018 to May 2019 with a storage of 75 GB in total. Based on the approximately one million items of ADS-B data in six-month period, we have carried out a traffic flow statistic for 240 routes defined by arbitrary two airports. To make a description and analysis without loss of generality, we selected several typical busy air routes from different regions, i.e., Shanghai-Tianjin route, Beijing-Wuhan route, Nanjing-Beijing route, and Guangzhou-Shanghai route.

B. Results of Statistical Air Route Flow
The statistical hourly number of the flights inside the Beijing-Wuhan route is shown in the Fig. 7. the left is the hourly flow value of certain day in May and the right shows the overall traffic flow trend in May. Beijing and Wuhan are one of the most prosperous cities in China, and the air route connecting the two cities is expected to be busy.
To make comprehensive analyses and reasonable comparisons, we choose another three air routes, i.e., Shanghai-Tianjing route, Nanjing-Beijing route and Guangzhou-Shanghai route. The statistical results of the traffic flow are depicted in Fig. 8. As the figure shows, the air traffic in the four air routes follow a similar periodic law, and all the statistical flows of these routes show obvious peaks and valleys.
In the early hours of the morning, the route traffic was very sparse. It begins to rise at 5:00 am and reaches the peak at about 10:00 am and continues to maintain a peak flow within about ten hours. The flow begins to decrease gradually from 9:00 pm. Because of the connection between major cities in the north and south, the Nanjing-Beijing and Shanghai-Tianjin routes both have heavy traffic, exhibiting median peak traffic flow reaching 323 and 278, respectively.

C. Results of Air Traffic Flow Prediction
The dataset for the proposed two predictors (SVR-based and LSTM-based) is generated from the hourly separated flow information of the 240 routes from November 2018 to May 2019. Since the existence of abnormal data conditions, any traffic flow far exceeds the historical average value (above 1000) will be regarded as an invalid value and will be replaced by the historical average value for this hour. The input of the predictors are some selected features as follows: • Hour of day, day of the week, day of the month, and season of the year. • Whether the day is statutory holiday. • Historical average traffic flow. The performance of the SVR-based predictor can be greatly influenced by the selection of the kernel function. There are many kinds of kernel functions, e.g., linear function, polynomial function, and radial basis function (RBF) [41]. We compare the models using the three kernel functions with the same air route dataset to select an appropriate kernel function. Based on the tested case, we choose RBF as the kernel function because of its outstanding efficiency in the prediction task compared with other kernel functions.
Then two parameters are considered. The parameter penalty C determines the size of the weight vector, and the parameter gamma from RBF determines the width of the RBF corre-sponding to each support vector and thus affects generalization performance. In order to obtain an appropriate parameter combination, a grid search method has been implemented and the best combination is selected by calculating RMSE score. The combinations of the parameters are shown in Table I. The data was divided into training and testing datasets with a ratio of 3:1 for the LSTM-based predictor. There are many parameters that affect the performance of the LSTMbased model, which are also critical to be determined. The parameters and options that need to be considered for the LSTM-based model are as follow in Table II.  To find the best combination of parameters, we train the model with different hyper-parameters by using a grid search method, and the aim is to obtain the lowest RMSE score.

D. Analysis
When processing a large amount of air route traffic flow data, the SVR predictor spends more time than the LSTMbased predictor. And the RMSE score of the two predictors are 38.31 and 29.73, respectively. The results of the two prediction models are depicted in Fig. 9 and Fig. 10, respectively. In the two figures, the top sub-figures present the predicted air flow information, where the blue line, orange line, and green line are true air flow value, validated value, and tested value, respectively.
As the traffic restrictions caused by some activities may bring in abnormal factors to the air traffic. The residual figures (bottom in Fig. 9 and Fig. 10) show the prediction ability considering abnormal traffic flow, and we can see that the LSTM-based model shows superior performance in mitigating the large residual values caused by the abnormal factors. On the other hand, the residual distributions are shown in Table  III. In detail, the SVR-based prediction model counts 62.79%, 23.14%, 11.23%, and 2.84% for the different intervals, respectively. And the LSTM-based model counts 77.06%, 12.55%, 8.24%, and 2.15%, respectively. The results also demonstrate the better performance of LSTM-based model in predicting the air traffic flow.

VI. CONCLUSION AND FUTURE WORK
In this paper, we proposed a model for calculating the air traffic flow using the big volume ADS-B data obtained by our aviation big data platform. Based on this model, we calculated the flow information of more than 200 air routes, and further made some visualization tasks for the sake of ATFM. Moreover, two prediction models applying SVR and LSTM were proposed to facilitate a timely surveillance and optimization of the traffic flow. Due to the obvious peaks and valleys of the flow variation, we found that the SVRbased predictor can accurately predict the flow because of its advantage in handling data with recognizable trend. The LSTM-based predictor can provide improved performance over the SVR-based one in the tested case, and it is inferred that the LSTM-based predictor can better deal with abnormal points of the flow variation.
Future work in the field of ATFM could be investigated. For instance, the air traffic statistics and prediction tasks can be executed in airports and control sectors. Features concerning the abnormal change of the flow can be investigated. Additionally, the advantages of different models can be combined to seek a better prediction accuracy.