[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

前一篇從個人角度介紹英文論文模型設計(model design)如何撰寫。這篇文章將介紹英文論文實驗評估(evaluation)部分,即experimental evaluation或experimental study,主要以入侵檢測系統為例(intrusion detection system),詳細的對比分析下篇介紹。一方面自己英文太差,只能通過最土的辦法慢慢提升,另一方面是自己的個人學習筆記,并分享出來希望大家批評和指正。希望這篇文章對您有所幫助,這些大佬是真的值得我們去學習,獻上小弟的膝蓋~fighting!

這里選擇的論文多數為近三年的CCF A和二區以上為主,尤其是頂會頂刊。當然,作者能力有限,只能結合自己的實力和實際閱讀情況出發,也希望自己能不斷進步,每個部分都會持續補充。可能10年、20年后,也能從自己的角度分享論文如何撰寫,目前主要以學習和筆記為主,再次強調這是筆記。大佬還請海涵。

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

文章目錄:

一.實驗評估如何撰寫1.論文總體框架及實驗撰寫2.實驗評估撰寫3.討論撰寫4.實驗評估撰寫之個人理解5.整體結構撰寫補充二.入侵檢測系統論文實驗評估句子第1部分:引入第2部分:數據集介紹第3部分:評估指標第4部分:實驗環境三.實驗圖表四.總結


一.實驗評估如何撰寫

論文如何撰寫因人而異,作者僅分享自己的觀點,歡迎大家提出意見。然而,堅持閱讀所研究領域最新和經典論文,這個大家應該會贊成,如果能做到相關領域文獻如數家珍,就離你撰寫第一篇英文論文更近一步了。

在實驗設計中,重點是如何通過實驗說服審稿老師,贊同你的創新點,體現你論文的價值。好的圖表能更好地表達你論文的idea,因此我們需要學習優秀論文,一個驚喜的實驗更是論文成功的關鍵。注意,安全論文已經不再是對比PRF的階段了,一定要讓實驗支撐你整個論文的框架。同時,多讀多寫是基操,共勉!

1.論文總體框架及實驗撰寫

該部分回顧和參考周老師的博士課程內容,感謝老師的分享。典型的論文框架包括兩種(The typical “anatomy” of a paper),如下所示:

第一種格式:理論研究

Title and authorsAbstractIntroductionRelated Work (可置后)Materials and MethodsResultsAcknowledgementsReferences

第二種格式:系統研究

Title and authorsAbstractIntroductionRelated Work (可置后)System ModelMathematics and algorithmsExperimentsAcknowledgementsReferences

實驗評估介紹(Evaluation)

許多論文對他們的方法進行了實證校驗當你剛到一個領域時,你應該仔細檢查這項工作通常是如何完成的注意所使用的數據集和代碼也很有幫助——因為您可能在將來自己使用它們


2.實驗評估撰寫

該部分主要是學習易莉老師書籍《學術寫作原來是這樣》,后面我也會分享我的想法,具體如下:

結果與方法一種是相對容易寫作的部分,其內容其實就是你對收集來的數據做了什么樣的分析。對于相對簡單的結果(3個分析以內),按部就班地寫就好。有專業文獻的積累,相信難度不大。寫起來比較困難的是復雜數據的結果,比如包括10個分析,圖片就有七八張。這時候對結果的組織就非常重要了。老師推薦《10條簡單規則》一文中推薦的 結論驅動(conclusion-driven) 方法。

在數據處理的過程中,梳理、總結自己的主要發現,以這些發現為大綱(小標題),來組織結果的寫作(而不是傳統上按照自己數據處理的順序來組織)。以作者發表論文為例,他們使用了這種方法來組織結果部分,分為四個小標題,每個小標題下列出相應的分析及結果。

(1) Sampling optimality may increase or decrease with autistic traits in different conditions(2) Bimodal decision times suggest two consecutive decision processes(3) Sampling is controlled by cost and evidence in two separate stages(4) Autistic traits influence the strategic diversity of sampling decisions

如果還有其他結果不能歸入任何一個結論,那就說明這個結果并不重要,沒有對形成文章的結論做出什么貢獻,這時候果斷舍棄(或放到補充材料中)是明智的選擇。

另外,同一種結果可能有不同的呈現方式,可以依據你的研究目的來采用不同的方式。我在修改學生文章時遇到比較多的一個問題是采用奇怪的方式,突出了不重要的結果。舉例:

對于結果的呈現,作圖是特別重要的,一張好圖勝過千言萬語。 但我不是作圖方面的專家,如果你需要這方面的指導,建議你閱讀《10個簡單規則,創造更優圖形》,文中為怎么做出一張好圖提供了非常全面而有用的指導。


3.討論撰寫

該部分主要是學習易莉老師書籍《學術寫作原來是這樣》,后面我也會分享我的想法,具體如下:

討論是一個非常頭疼的部分。先來講講討論的寫法,在前面強調了從大綱開始寫的好處,從大綱開始寫是一種自上而下的寫法,在寫大綱的過程中確定主題句,然后再確定其他內容。還有一種方法是自下而上地寫,就是先隨心所以地寫第一稿,從筆記開始寫,然后對這些筆記進行梳理和歸納,提煉主題句。老師通常混合兩種寫法,先從零星的點進行歸納(寫前言時對文獻觀點做筆記,寫討論時對結果的發現做筆記),之后通過梳理,整理出大綱,再從大綱開始寫作。

比如我對某篇文章的討論部分做過相關筆記,然后對這些點進行梳理和歸納,再結合前沿提出來的三個研究問題形成討論的大綱,如下:

(1) 總結主要發現(2) Distrust and deception learning in ASD(3) Anthropomorphic thinking of robot and distrust(4) Human-robot vs. interpersonal interactions(5) Limitations(6) Conclusions

在(1)到(4)段的討論中,要先總結自己最重要的發現,不要忘記回顧前言中提出的實驗預期,說明結果是否符合自己的預期。然后回顧前人研究與自己的研究發現是否一致,如果不一致,就可以討論可能的原因(取樣、實驗方法的不同等)。

此外還需要注意,很多學生把討論的重點放在了與前人研究不一致的結果和自己的局限性上,這些是需要寫的,但是最重要的是突出自己研究的貢獻。

討論中最常出現的問題就是把結果里的話換個說法再說一遍。其實討論部分給了我們一個從更高層面梳理和解讀研究結果的機會。更重要的是,需要明確提出自己的研究貢獻,進一步強調研究的重要性、意義以及創新性。因此,不要停留在就事論事的結果描述上。讀者讀完結果后,很容易產生“so what”的問題——“是的,你發現了這些,那又怎么樣呢?”。

這時候,最重要的是告訴讀者研究的啟示(implication)——你的發現說明了什么,加深了對什么問題的理解,對未解決的問題提供了什么新的解決方法,揭示了什么新的機制。這也是影響稿件錄用的最重要部分,所以一定要花最多時間和精力來寫這個部分。

用前文提到的“機器人”文章的結論作為例子,說明如何總結和升華自己的結論。


4.實驗評估撰寫之個人理解

首先我們要清楚實驗寫作的目的,通過詳細準確的數據集、環境、實驗描述,仿佛能讓別人模仿出整個實驗的過程,更讓讀者或審稿老師信服研究方法的科學性,增加結果數據的準確性和有效性。

研究問題、數據集(開源 | 自制)、數據預處理、特征提取、baseline實驗、對比實驗、統計分析結果、實驗展示(圖表可視化)、實驗結果說明、論證結論和方法

如果我們的實驗能發現某些有趣的結論會非常棒;如果我們的論文就是新問題并有對應的解決方法(創新性強),則實驗需要支撐對應的貢獻或系統,說服審稿老師;如果上述都不能實現,我們盡量保證實驗詳細,并通過對比實驗(baseline對比)來鞏固我們的觀點和方法。

切勿只是簡單地對準確率、召回率比較,每個實驗結果都應該結合研究背景和論文主旨進行說明,有開源數據集的更好,沒有的數據集建議開源,重要的是說服審稿老師認可你的工作。同時,實驗步驟的描述也非常重要,包括實驗的圖表、研究結論、簡明扼要的描述(給出精讀)等。

在時態方面,由于是描述已經發生的實驗過程,一般用過去時態,也有現在時。大部分期刊建議用被動句描述實驗過程,但是也有一些期刊鼓勵用主動句,因此,在投稿前,可以在期刊主頁上查看“Instructions to Authors”等投稿指導性文檔來明確要求。一起加油喔~

下面結合周老師的博士英語課程,總結實驗部分我們應該怎么表達。

圖/表的十個關鍵點(10 key points)

說明部分要盡量把相應圖表的內容表達清楚圖的說明一般在圖的下邊表的說明一般在標的上邊表示整體數據的分布趨勢的圖不需太大表示不同方法間細微差別的圖不能太小幾個圖并排放在一起,如果有可比性,并排圖的取值范圍最好一致,利于比較實驗結果跟baseline在絕對數值上差別不大,用列表加黑體字實驗結果跟baseline在絕對數值上差別較大,用柱狀圖/折線圖視覺表現力更好折線圖要選擇適當的顏色和圖標,顏色選擇要考慮黑白打印的效果折線圖的圖標選擇要有針對性:比如對比A, A+B, B+四種方法:A和A+的圖標要相對應(例如實心圓和空心圓),B和B+的圖標相對應(例如實心三角形和空心三角形)

說明部分要盡量把相應圖表的內容表達清楚

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

圖的說明一般在圖的下邊;表的說明一般在表的上邊;表示整體數據的分布趨勢的圖不需太大;表示不同方法間細微差別的圖不能太小。

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

幾個圖并排放在一起,如果有可比性,并排圖的x/y軸的取值范圍最好一致,利于比較。

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

實驗結果跟baseline在絕對數值上差別不大,用列表加黑體字;實驗結果跟baseline在絕對數值上差別較大,用柱狀圖/折線圖視覺表現力更好。

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

折線圖要選擇適當的顏色和圖標,顏色選擇要考慮黑白打印的效果;折線圖的圖標選擇要有針對性,比如對比A, A+,B, B+四種方法。

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

5.整體結構撰寫補充

同時,模型設計整體結構和寫作細節補充幾點:(引用周老師博士課程,受益匪淺)

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

二.IDS論文實驗評估句子第1部分:引入

該部分在實驗評估環節主要作為引入,通常是介紹實驗模塊由哪幾部分組成。同時,有些論文會直接給出實驗的各個小標題,這時會省略該部分。

In this section, we employ four datasets and experimentally evaluate four aspects of WATSON: 1) the explicability of inferred Event semantics; 2) the accuracy of behavior abstraction; 3) the overall experience and manual workload reduction in attack investigation; and 4) the performance overhead.

Jun Zeng, et al. WATSON: Abstracting Behaviors from Audit Logs via Aggregation of Contextual Semantics. NDSS.

In this section, we prototype whisper and evaluate its performance by using 42 real-world attacks. In particular, the experiments will answer the three questions:

(1) If Whisper achieves higher detection accuracy than the state-of-the-art method? (Section 6.3)(2) If Whisper is robust to detect attacks even if an attackers try to evade the detection of Whisper by leveraging the benign traffic? (Section 6.4)(3) If Whisper achieves high detection throughput and low detection latency? (Section 6.5)Chuanpu Fu, et al. Realtime Robust Malicious Traffic Detection via Frequency Domain Analysis. CCS.

We first describe the testbed and data sets we use in the experiment. Then we evaluate the system by comparing it with other classical intrusion detection systems on a series of critical axes such as detection rate, false alarm rate, detection time, query time and storage overhead.

Yulai Xie, et al. Pagoda: A Hybrid Approach to Enable Efficient Real-Time Provenance Based Intrusion Detection in big data Environments. TDSC.

In this section, we evaluate our approach with the following major goals:

Demonstrating the intrusion detection effectiveness of vNIDS. We run our virtualized NIDS and compare its detection results with those generated by Bro NIDS based on multiple real-world traffic traces (Figure 4).Evaluating the performance overhead of detection state sharing among instances in different scenarios: 1) without detection state sharing; 2) sharing all detection states; and 3) only sharing global detection states. The results are shown in Figure 5. The statistics of global states, local states, and forward statements are shown in table 2.Demonstrating the flexibility of vNIDS regarding placement location. In particular, we quantify the communication overhead between virtualized NIDS instances across different data centers that are geographically distributed (Figure 8).Hongda Li, et al. vNIDS: Towards Elastic Security with Safe and Efficient Virtualization of Network Intrusion Detection Systems. CCS.

In this section, we start analyzing the MUD profile of real consumer iot devices that we have generated, and highlight attack types that can be prevented. Then, we will use traces collected in our lab, when we launched a number of volumetric attacks to four of IoT devices, to show how our system can detect these attacks using off-the-shelf IDS in an operational environment.

Ayyoob Hamza, et al. Combining MUD Policies with SDN for IoT Intrusion Detection. IOT S&P.

In this section, we present the implementation of BiDlstm and discuss the experimental findings. We compare the model’s performance with state-of-the-art methods trained and tested on the same dataset (i.e., the NSL-KDD dataset). Also, we present a comparison of results with some recently published methods on the NSL-KDD dataset.

Yakubu Imrana, et al. A bidirectional LSTM deep learning approach for intrusion detection. Expert Systems With Applications.

In this section, we performed two major experiments (named Experiment1 and Experiment2) to explore the performance of disagreement-based semi-supervised learning and our DAS-CIDS in the aspects of detection performance and alarm filtration. In this work, we use the WEKA platform (WEKA) to help extract various classifiers like J48 and Random Forest to avoid implementation variations, which is an open-source software providing a set of machine learning algorithms.

Wenjuan Li, et al. Enhancing collaborative intrusion detection via disagreement-based semi-supervised learning in IoT environments. Journal of Network and Computer Applications.

In this experimental study, we exhibit the impact of the proposed methodology and select the informative features subset from the given intrusion dataset, that can classify the network traffics into normal or attacks for the intrusion detection. Two diagnostic studies were conducted to verify the impact of the proposed method, such as precision-recall analysis and ROC-AUC analysis, which is helpful in the analysis of probabilistic prediction for binary and multi-class classification problems. The main objectives of these experiments are summarized below,

To design and develop a univariate ensemble feature selection approach to identify the valuable reduced feature set from the given Intrusion datasets.To improve the classification efficiency using the majority voting ensemble method which may effectively classify the network traffics as normal and attack data.To evaluate this proposed work on three different intrusion datasets, namely Honeypot real-time datasets KDD and Kyoto.S. Krishnaveni, et al. Efficient feature selection and classification through ensemble method for network intrusion detection on cloud computing. Cluster Computing.


第2部分:數據集介紹

該部分主要介紹實驗數據集,通常包括數據集的組成及特征分布情況,結合表格描述效果更好。同時,如果有公用數據集(AI類較多),建議多個數據集對比,并且與經典的論文方法或baselines比較;如果是自身數據集,建議開源,但其對比實驗較難,怎么說服審稿人相信你的數據集是關鍵。

Datasets. The datasets used in our experiments are shown in Table 4. We use three recent datasets from the WIDE MAWI Gigabit backbone network [69]. In the training phase, we use 20% benign traffic to train the machine learning algorithms. We use the first 20% packets in MAWI 2020.06.10 dataset to calculate the encoding vector via solving the SMT problem (see Section 4.2). Meanwhile, we replay four groups of malicious traffic combined with the benign traffic on the testbed:

Traditional DoS and Scanning Attacks. We select five active attacks from the Kitsune 2 [42] and a udp DoS attack trace [7] to measure the accuracy of detecting high-rate malicious flow. To further evaluate Whisper, we collect new malicious traffic datasets on WAN including Multi-Stage TCP Attacks, Stealthy TCP Attacks, and Evasion Attacks.Multi-Stage TCP Attacks. TCP side-channel attacks exploit the protocol implementations and hijack TCP connections by generating forged probing packets. Normally, TCP side-channel attacks have several stages, e.g., active connection finding, sequence number guessing, and acknowledgement number guessing. We implement two recent TCP side-channel attacks [10, 17], which have different numbers of attack stages. Moreover, we collect another multi-stage attack, i.e., TLS padding oracle attack [67].Stealthy TCP Attacks. The low-rate TCP DoS attacks generate low-rate burst traffic to trick TCP congestion control algorithms and slow down their sending rates [25, 32, 33]. Low-rate TCP DoS attacks are more stealthy than flooding based DoS attacks. We construct the low-rate TCP DoS attacks with different sending rates. Moreover, we replay other low-rate attacks, e.g., stealthy vulnerabilities scanning [38].Evasion Attacks. We use evasion attack datasets to evaluate the robustness of Whisper. Attackers can inject noise packets (i.e., benign packets of network applications) into malicious traffic to evade detection [19]. For example, an attacker can generate benign TLS traffic so that the attacker sends malicious ssl renegotiation messages and the benign TLS packets simultaneously. Basing on the typical attacks above, we adjust the ratio of malicious packets and benign packets, i.e., the ratio of 1:1, 1:2, 1:4, and 1:8, and the types of benign traffic to generate 28 datasets. For comparison, we replay the evasion attack datasets with the same background traffic in Table 4.

NSL-KDD: We use the internet traffic dataset, NSL-KDD [45] (also used in AE attacks in IDS [9], but [9] dose not consider problem-space validity), for our evaluation. In NSL-KDD, each sample contains four groups of entries including Intrinsic Characteristics, Content Characteristics, Time-based Characteristics, and Host-based Characteristics. There are four categories of intrusion: DoS, Probing, Remote-to-Local (R2L), and User-to-Root (U2R) of which each contains more attack sub-categories. There are 24 sub-categories of attacks in the training set and 38 sub-categories of attacks are in test set (i.e., 14 sub-categories of attacks are unseen in the training set). There are 125,973 training records and 22,544 testing records. In our experiments, we only show the evaluations on an IDS model for discriminating DoS attacks from normal traffic since the results for the other three attacks are similar. The total number of entries for each record is 41 (in problem-space) which are further processed into 121 numerical features as an input-space (feature-space) vector.

MNIST: We also evaluate our approach on an image dataset, MNIST [46], to demonstrate its applicability. The images in MNIST are handwritten digits from 0 to 9. The corresponding digit of an image is used as its label. Each class has 6,000 training samples and 1,000 test samples. Therefore, the whole MNIST dataset has 60,000 training samples and 10,000 test samples. All the images have the same size of 28 × 28 and are in grey-level.

Currently, there are only a few public datasets available for intrusion detection evaluation. Among these datasets, the KDD Cup 99 dataset, NSL-KDD dataset and Kyoto 2006+ dataset have been commonly used in the literature to assess the performance of IDSes. According to the review by Tsai et al. [43], the majority of the IDS experiments were performed on the KDD Cup 99 datasets. In addition, these datasets have different data sizes and various numbers of features which provide comprehensive tests in Validating feature selection methods. Therefore, in order to facilitate a fair and rational comparison with other state-of-the-art detection approaches, we have selected these three datasets to evaluate the performance of our detection system.

The KDD Cup 99 dataset is one of the most popular and comprehensive intrusion detection datasets and is widely applied to evaluate the performance of intrusion detection systems [43]. It consists of five different classes, which are normal and four types of attack (i.e., DoS, Probe, U2R and R2L). It contains training data with approximately five million connection records and test data with about two million connection records. Each record in these datasets is labeled as either normal or an attack, and it has 41 different quantitative and qualitative features.

The NSL-KDD is a new revised version of the KDD Cup 99 that has been proposed by Tavallaee et al. in [24]. This dataset addresses some problems included in the KDD Cup 99 dataset such as a huge number of redundant records in KDD Cup 99 data. As in the case of the KDD Cup 99 dataset, each record in the NSL-KDD dataset is composed of 41 different quantitative and qualitative features.

We evaluate WATSON on four datasets: a benign dataset, a malicious dataset, a background dataset, and the DARPA TRACE dataset. The first three datasets are collected from ssh sessions on five enterprise servers running ubuntu 16.04 (64-bit). The last dataset is collected on a network of hosts running Ubuntu 14.04 (64-bit). The audit log source is linux Audit [9].

In the benign dataset, four users independently complete seven daily tasks, as described in Table I. Each user performs a task 150 times in 150 sessions. In total, we collect 17 (expected to be 4×7 = 28) classes of benign behaviors because different users may conduct the same operations to accomplish tasks. Note that there are user-specific artifacts, like launched commands, between each time the task is performed. For our benign dataset, there are 55,296,982 audit events, which make up 4,200 benign sessions.

In the malicious dataset, following the procedure found in previous works [2], [10], [30], [53], [57], [82], we simulate3 eight attacks from real-world scenarios as shown in Table II. Each attack is carefully performed ten times by two security engineers on the enterprise servers. In order to incorporate the impact of typical noisy enterprise environments [53], [57], we continuously execute extensive ordinary user behaviors and underlying system activities in parallel to the attacks. For our malicious dataset, there are 37,229,686 audit events, which make up 80 malicious sessions.

In the background dataset, we record behaviors of developers and administrators on the servers for two weeks. To ensure the correctness of evaluation, we manually analyze these sessions and only incorporate sessions without behaviors in Table I and Table II into the dataset. For our background dataset, there are 183,336,624 audit events, which make up 1,000 background sessions.

In general, our experimental behaviors for abstraction are comprehensive as compared to behaviors in real-world systems. Particularly, the benign behaviors are designed based upon basic system activities [84] claimed to have drawn attention in cybersecurity study; the malicious behaviors are either selected from typical attack scenarios in previous work or generated by a red team with expertise in instrumenting and collecting data for attack investigation.

The datasets applied in this proposed work are the following: (1) Real-time Honeypot Dataset (2) Kyoto 2006+ Dataset and (3) NSL-KDD.

Real-time honeypot dataset. In this research work, honeypots were set up on the AWS public cloud. The real-time data was collected during the period August 19th, 2018 to September 19th, 2018. And then, log data was collected for further analysis, resulting in over 5,195,499 attacker’s log entries. The proposed Honeynet system demonstrates the system configuration of container-based honeypots that can investigate and discover the attacks on a cloud system [27].NSL-KDD dataset. The KDD Cup99 dataset is a popular dataset used for network-based intrusion detection. It has the drawback of several redundant records, which will affect the effectiveness of the evaluated systems. Pervez et al. [28] have presented a new built form of KDD99 named as NSLKDD for overcoming these issues. The KDDTrain+ and KDDTest+ sets of NSL KDD dataset have approximately 125,973 and 22,544 connection records correspondingly. Similar to KDD99, each record in this data is unique as it is labeled with attack or normal based on the 41-feature set. NSL-KDD dataset includes the same four categories of attacks as the original KDD-99 Dataset.Kyoto dataset. Kyoto dataset was anticipated by Song et al. [29] This dataset is created on real-time 3 years of network traffic data from regular servers and honeypots. The data was utilized for further analysis of approximately 257,673 records. Each connection in the dataset was seen to be unique with 24 features and 14 statistical features from KDD Cup 99 dataset, also 10 features from their networks, were extracted by the authors. The statistics of the intrusion datasets were utilized for the experiments is displayed in Table 1.

This section discusses the three intrusion detection datasets that have been used in this paper for experimentation purposes. This includes NSL-KDD, CIDDS-001, and CICIDS2017 datasets.

The NSL-KDD (Network Socket Layer – Knowledge Discovery in Databases) dataset was developed in 2009 as the successor of the KDD 1999 dataset [46]. The NSL-KDD dataset overcame the drawbacks of the KDD dataset by removing several redundant and duplicate samples in training and testing datasets. It was created to maximize prediction difficulty, and this characteristic makes it a preferred choice by researchers even today [47]. NSL-KDD consists of separate training and testing datasets containing network traffic samples represented by 41 Attributes. Each instance has a label corresponding to the normal class or one of the 22 attack types. These attack types are grouped into four major attack classes, namely Denial of Service (DoS), Probe, Remote to Local (R2L), and User to Root (U2R). Table 3 shows the number of samples present in various classes of the NSL-KDD dataset. The uneven distribution of samples in different classes of this dataset makes it an appropriate choice for testing the proposed LIO-IDS.

The CICIDS2017 dataset was developed by Sharafaldin et al. [49] by generating and capturing network traffic for a duration of five days. The dataset consists of normal traffic samples and traffic samples generated from fourteen different types of attacks. The authors utilized the B-profile system to imitate benign human activities on the web and generate normal traffic from http, https, FTP, and SSH protocols. Different categories of attacks were generated using various tools available on the Internet. The original CICIDS2017 dataset consists of eight CSV files containing 22,73,097 normal samples and 5,57,646 attack samples. Each traffic sample consists of 80 features that were captured using the CICFlowMeter tool. Due to the huge size of the original dataset, a subset of the CICIDS2017 dataset was selected for experimentation in this paper. The details of the selected subsets have been shown in Table 5.

The intrusion detection datasets selected in this paper consist of categorical as well as numerical attribute values. To bring these values in a uniform format, dataset pre-processing was performed on both of them. This process has been explained in the following sub-section.

The NSL-KDD dataset (Tavallaee et al., 2009; UNB, 2009) is one of the bench-marked datasets for evaluating Intrusion Detection Systems (IDS). It is an enhanced form of the KDDCup ’99 dataset (Dua & Graff, 2017). The dataset comprises a training set (KDDTrain+) with 125,973 traffic samples and two separate test sets (i.e., KDDTest+ and KDDTest?21). The KDDTest+ has 22,544 traffic samples, and the KDDTest?21 has 11,850 samples. Additionally, to make the intrusion detection more realistic, the test datasets include many attacks that do not appear in the training set (see Table 2). Thus, adding to the 22 types of attacks in the training set, 17 more different attack types exist in the test set.

The NSL-KDD dataset contains 41 features, including 3 non-numeric (i.e., ????????_????, ??????? and ? ???) and 38 numeric features, as shown in Table 1. It has a class label grouped into two categories (anomaly and normal) for binary classification. For multi-class classification, we group the label into five categories (i.e., Normal, Denial of Service (DoS), User-to-Root (U2R), Remote-to-Local (R2L), and Probe). Table 3 gives a summary of the number of traffic records in the NSL-KDD dataset.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

第3部分:評估指標

該部分介紹模型的評估指標,常見的包括準確率、召回率、精確率、F值、誤報率等。

(1) Chuanpu Fu, et al. Realtime Robust Malicious Traffic Detection via Frequency Domain Analysis. CCS.

Metrics. We use the following metrics to evaluate the detection accuracy: (i) true-positive rates (TPR), (ii) false-positive rates (FPR), (iii) the area under ROC curve (AUC), (vi) equal Error rates (EER). Moreover, we measure the throughput and processing latency to demonstrate that Whisper achieves realtime detection.

(2) Yakubu Imrana, et al. A bidirectional LSTM deep learning approach for intrusion detection. Expert Systems With Applications.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(3) Neha Gupta, et al. LIO-IDS: Handling class imbalance using LSTM and improved one-vs-one technique in intrusion detection system. Computer Networks.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(4) S. Krishnaveni, et al. Efficient feature selection and classification through ensemble method for network intrusion detection on cloud computing. Cluster Computing.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(5) Mohammed A. Ambusaidi, et al. Building an Intrusion Detection System Using a Filter-Based Feature Selection Algorithm. IEEE TRANSACTIONS ON COMPUTERS.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(6) Congyuan Xu, et al. A Method of Few-Shot Network Intrusion Detection Based on Meta-Learning Framework. IEEE TIFS.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(7) Ning Wang, et al. MANDA: On Adversarial Example Detection for Network Intrusion Detection System. IEEE INFOCOM.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(8) Vipin Kumar Kukkala, INDRA: Intrusion Detection Using Recurrent autoencoders in Automotive Embedded Systems. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

評估算法的混淆矩陣如下:

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

第4部分:實驗環境

該部分作者包含了Experiment Setup或Implementation相關內容,主要介紹baselines或實驗環境內容,以及模型的超參數、數據采集方法等。部分論文會介紹實驗中的相關假設。

We prototype Whisper using C/c++ (GCC version 5.4.0) and python (version 3.8.0) with more than 3,500 lines of code (LOC). The source code of Whisper can be found in [21].

High Speed Packet Parser Module. We leverage Intel Data Plane Development Kit (DPDK) version 18.11.10 LTS [26] to implement the data plane functions and ensure high performance packet parsing in high throughput networks. We bind the threads of Whisper on physical cores using DPDK APIs to reduce the cost of context switching in CPUs. As discussed in Section 4.1, we parse the three per-packet features, i.e., lengths, timestamps, and protocol types.Frequency Domain Feature Extraction Module. We leverage pytorch [52] (version 1.6.0) to implement matrix transforms (e.g., encoding and Discrete Fourier Transformation) of origin per-packet features and auto-encoders in baseline methods.Statistical Clustering Module. We leverage K-Means as the clustering algorithm with the mlpack implementation (version 3.4.0) [44] to cluster the frequency domain features.Automatic Parameter Selection. We use Z3 SMT solver (version 4.5.1) [40] to solve the SMT problem in Section 4.2, i.e., determining the encoding vector in Whisper.

Moreover, we implement a traffic generating tool using Intel DPDK to replay malicious traffic and benign traffic simultaneously. The hyper-parameters used in Whisper are shown in Table 3.

Baselines. To measure the improvements achieved by Whisper, we establish three baselines:

Packet-level Detection. We use the state-of-the-art machine learning based detection method, Kitsune [42]. It extracts per-packet features via flow state variables and feeds the features to auto-encoders. We use the open source Kitsune implementation [41] and run the system with the same hardware as Whisper.Flow-level Statistics Clustering (FSC). As far as we know, there is no flow-level malicious traffic detection method that achieves task agnostic detection. Thus, we establish 17 flow-level statistics according to the existing studies [4, 5, 30, 37, 43, 77] including the maximum, minimum, variance, mean, range of the per-packet features in Whisper, flow durations, and flow byte counts. We perform a normalization for the flow-level statistics. For a fair comparison, we use the same clustering algorithm to Whisper.Flow-level Frequency Domain Features with Auto-Encoder (FAE). We use the same frequency domain features as Whisper and an auto-encoder model with 128 hidden states and Sigmoid activation function, which is similar to the auto-encoder model used in Kitsune. For the training of the auto-encoder, we use the Adam optimizer and set the batch size as 128, the training epoch as 200, the learning rate as 0.01.

Testbed. We conduct the Whisper, FSC, and FAE experiments on a testbed built on a DELL server with two Intel Xeon E5645 CPUs (2 × 12 cores), Ubuntu 16.04 (Linux 4.15.0 LTS), 24GB memory, one Intel 10 Gbps NIC with two ports that supports DPDK, and Intel 850nm SFP+ laser ports for optical fiber connections. We configure 8GB huge page memory for DPDK (4GB/NUMA Node). We bind 8 physical cores for 8 NIC RX queues to extract per-packet features and the other 8 cores for Whisper analysis threads, which extract the frequency domain features of traffic and perform statistical clustering. In summary, we use 17 of 24 cores to enable Whisper.

Note that, since Kitsune cannot handle high-rate traffic, we evaluate it with offline experiments on the same testbed. We deploy DPDK traffic generators on the other two servers with similar configurations. The reason why we use two traffic generators is that the throughput of Whisper exceeds the physical limit of 10 Gbps NIC, i.e., 13.22 Gbps. We connect two flow generators with optical fibers to generate high speed traffic.

We have implemented a prototype of Hawkware on a Raspberry Pi 3 Model B+ board which has a 1.4 GHz quad-core ARM Cortex-A53 processor with 1 GB RAM as it resembles many ARM-based IoT devices. We bound Hawkware to a single core for its computation with a 32 bit Linux OS.

We incorporated Tshark, a network packet capturing and analyzing tool, in implementing PA and used ftrace, an event tracing framework available in Linux kernels, in SCL. FP and HC are implemented in Python. Hawknet is trained offline on a separate server and then deployed on devices to perform detection. Hawknet and its training code are implemented with tensorflow, which is one of the most popular frameworks for machine learning.

However, directly deploying this model strains IoT devices. In order to mitigate this issue, we first leveraged ARM’s NEON SIMD instructions to accommodate the high degree of parallelism inherent in Hawknet. Unfortunately, due to the high memory pressure in ANN computation for loading its weight values, utilizing NEON alone still falls short of making Hawknet efficient enough for IoT devices. Therefore, in addition, we capitalized on ANN weight quantization [15], compressing the vector values of Hawknet from 32-bit floating point numbers to 8-bit fixed point numbers. The compressed model of Hawknet, generated by employing the Tensorflowlite, only occupies 60KB. The learning rate was set to 0.001, which is a standard starting point for typical deep learning. The number of parameters in each layer of Hawknet is set as following: NBA’s embedding layer, encoding layer, decoding layer and reconstruction layer each respectively have 297, 3840, 567 and 297 parameters, dba’s embedding layer, LSTM layer and softmax layer each have 3160, 840 and 3476 parameters and there are 210 parameters for CC.

We implemented the problem-space attacks and MANDA in TensorFlow. We ran all the experiments on a server equipped with an Intel Core i7-8700K CPU 3.70GHz×12, a GeForce RTX 2080 Ti GPU, and Ubuntu 18.04.3 LTS. The IDS model is a muti-layer perceptron (MLP) composed of one input layer, one hidden layer with 50 neurons and one output layer. For completeness, we also implemented other models for IDS including Logistic Regression (LGR), K-Nearest Neighbors (KNN), Naive Bayes classifier for multivariate Bernoulli (BNB), Decision Tree Classifier (DTC) and Support Vector Machine (SVM) from scikit-learn library [47]. We implement four AE attacks including FGSM, BIM, CW (the L2-norm version) and JSMA (cf. Section III-C) and adapt the first three to problem-space of IDS. In each experiment, we generate AEs on the test samples that are correctly classified by the IDS model. Note here that we do not generate AEs for misclassified test samples. Next, we combine the successful AEs and the same number of clean data points (randomly selected) together as a mixed dataset, on which we run all detection algorithms. The benchmark for comparison is Artifact [17], the same as in [14], [44]. Artifact is proposed by Feinman et al. in [17] and becomes one of the state-of-the-art AE detection scheme. Different from MANDA, Artifact uses kernel density estimation (KDE) and Bayesian neural network uncertainty as two criteria to detect AEs.

On MNIST dataset, we use a convolutional neural network (cnn) rather than the above MLP as the target model for AE attacks. The CNN model comprises 4 convolutional layers with ReLU activation, followed by 2 fully-connected layers.

In this section, we present the implementation of BiDLSTM and discuss the experimental findings. We compare the model’s performance with state-of-the-art methods trained and tested on the same dataset (i.e., the NSL-KDD dataset). Also, we present a comparison of results with some recently published methods on the NSL-KDD dataset.

The proposed model is a bidirectional LSTM implemented in python programming language using TensorFlow and keras. The Adaptive Moment Estimation (Adam) algorithm is the optimizer used to update the model’s weights with a learning rate of 0.001. The loss functions used are the binary cross-entropy for binary classification and the categorical cross-entropy for multi-class classification. As shown in Fig. 3, the model starts by mapping inputs to their representations using an embedding layer. It then feeds the embeddings to the LSTM layers with two processing directions. The first in the forward direction and the other in the reversed direction. The LSTM outputs are then fed to fully connected layers with the rectified linear unit (ReLU) as an activation function. Ideally, the fully connected layers learn and compile the extracted data by the LSTM layers to form a final output that passes through an output layer for classification. finally, we apply a dropout probability of 0.2 to the layers to ensure that our model does not over-fit the data. Table 4 displays a summary of the proposed model architecture.

The model’s performance is validated using a stratified K-fold cross-validation method with K set to 10. The stratified K-fold ensures that the sample percentage for each of the classes is equal in every fold. The process first shuffles the dataset and then splits it into K groups. Then fit the model with K-1 (10–1) folds and validated with the Kth folds remaining (9 folds). This process repeats until the last K-fold. Thus, it repeats until every K-fold serves as the test set. We record each fold’s scores as depicted in Fig. 4 and then take the mean of these scores as the model’s performance.

To evaluate the performance of the INDRA framework, we first present an analysis for the selection of IT. Using the derived IT, we contrast it against the two variants of the same framework: 1) INDRA-LED and 2) INDRA-LD. The former removes the linear layer before the output and essentially leaving the gru to decode the context vector. The term LED implies (L) linear layer, (E) encoder GRU and (D) decoder GRU. The second variation replaces the GRU and the linear layer at the decoder with a series of linear layers (LD implies linear decoder). These experiments were conducted to test the importance of different layers in the network. However, the encoder end of the network is not changed because we require a sequence model to generate an encoding of the timeseries data. We explored other variants as well, but they are not included in the discussion as their performance was poor compared to the LED and LD variants.

Subsequently, we compare the best variant of our framework with three prior works: 1) predictor LSTM (PLSTM [25]); 2) replicator neural network (RepNet [26]); and 3) CANet [23]. The first comparison work (PLSTM) uses an LSTM-based network that is trained to predict the signal values in the next message transmission. PLSTM achieves this by taking the 64-b CAN message payload as the input, and learns to predict the signal at a bit-level granularity by minimizing the prediction loss. A log loss or binary cross-entropy loss function is used to monitor the bit level deviations between the real next signal values and the predicted next signal values, and the gradient of this loss function is computed using backpropagation to update the weights in the network.

Our experimental process consisted of three phases. Phase 1 was related to (i) live sample data Collection of smart home behaviour (in terms of the data sources monitored) when not under attack and (ii) execution of each attack vector. This phase comprised two different types of experiments: one where users were present during data collection and another where no users were present in the household. Phase 2 was related to the adaptation of the offline reinforcement learning anomaly detection. Phase 3 was related to live monitoring of attack detection using the RL-optimised MAGPIE configuration.

Table III provides statistics about the live capture sample dataset for normal and attack execution experiments. Some attack vectors (WiFi de-authentication and ZigBee jamming) were observed to have a persistent effect on specific device behaviour, such as total connectivity loss to the WiFi network or disconnection of ZigBee nodes from the PAN, even after the attack had stopped. To ensure that persistent symptoms of one experiment did not interfere with another, after each attack execution, we reconnected affected devices and nodes to their respective networks and tested the automation rules to ensure that the smart home had returned to a known good state. For phase 1, each attack vector was executed independently so that normal and attack data samples were equally distributed with respect to the amount of time the smart home was monitored by MAGPIE under normal conditions and during attack execution. This process ensured that the captured dataset had a balanced set of normal and attack samples for testing. All live sample collection experiments were conducted on the training data for phase 2 reinforcement learning adaptation of the MAPGIE’s anomaly models, whereas phase 3 consisted of executing live attack vectors against the MAGPIE prototype in a real-time monitoring state with the optimised anomaly model configuration. During the experiment, the users interacted with the smart home according to their normal routine. This activity generated a dataset that represented natural smart home user behaviour. Table X shows the different types of interactions performed by the users.

In all experiments, the value of MI is estimated using the estimator proposed by Kraskov et al. [33] (discussed in Section 3.1). To select the best value of k used in the estimator for the approach of k-nearest neighbour, several experiments with different values for k are conducted. Through the experiments, we have found that the best estimated value of MI was achieved when k = 6, which is the same as the value suggested in [33]. In addition, the control parameter b for MIFS algorithm is varied in the range of [0,1], which is the range suggested in [11] and [34], with a step size of 0.1. The optimal value of b that gives the best accuracy rate is selected for a comparison with the proposed approach.

Empirical evidence shows that 0.3 is the best value for b in the three datasets, so we included the results with this optimal b value for comparison. We have also included the results with the value of b equal to 1, which is the same as the value applied in [34]. The reason of choosing different values of b is to test all possibilities of the feature rankings since the best value is undefined for the given problem. The experimental results of different values of b indicate that when the value is closer to 1 the MIFS algorithm assigns larger weights to the redundant features. In other words, the algorithm places more emphasis on the relation between input features rather than between input features and the class and vice versa.

Based on the above findings, to demonstrate the superiority of the proposed feature selection algorithm, five LSSVM-IDSs are built based on all features and the features that are chosen using four different feature selection algorithms (i.e., the proposed FMIFS, MIFS (b = 0.3), MIFS (b = 1), FLCFS), respectively, with k ? 6. Three different datasets, namely KDD Cup 99 [41], NSL-KDD [24] and Kyoto 2006+ dataset [25], are used to evaluate the performance of these IDSs. The experimental results of the LSSVM-IDS based on FMIFS are compared with the results using the other four LSSVM-IDSs and several other state-of-the-art IDSs.

For the experiments on Kyoto 2006+ dataset, the data of 27, 28, 29, 30 and 31 August 2009 are selected, which contain the latest updated data. For the experimental aims on each dataset, 152,460 samples are randomly selected. A 10-fold cross-validation is used to evaluate the detection performance of the proposed LSSVM-IDS. In addition, in order to make a comparison with the detection system proposed in [20], the same sets of data captured from 1st to 3rd November 2007 are chosen for evaluation too. The comparison results are shown in Table 6.

This experiment uses the benchmark dataset NSL-KDD and 2014 standard dataset disclosed in the field of intrusion detection to evaluate the performance of our model. These public datasets have been pre-processed by common means and have become the organized data. Employing public datasets, on the one hand, can effectively reduce the impact of different datasets on the experimental results, on the other hand, it can enhance the experiment’s reproducibility. The NSL-KDD dataset is collected in the US Air Force network environment, including various user types and network traffic. The original file contains more than 5 million records, including four significant traffics (i.e., DoS, Probe, U2L, and R2L) of attack and normal types. Our experiments use 10 percent of the sample data as the main experimental data. In order to further show the performance of our model in different network environments, we also use the standard dataset released by the Critical Infrastructure Protection Center of Mississippi State University in 2014 to evaluate our model. The dataset contains data on the network attack of two control systems: Gas and Water. The experimental environment is PC, windows7 64-bit system, i7-6700 3.4 GHz CPU, 8 G RAM, Python language and scikit-learn machine learning library as programming languages and tools.

注意:下一篇將介紹實驗分析、實驗對比和討論。


三.實驗圖表

在下一篇文章我們將對實驗對比部分進行詳細介紹,這里先給出一些非常不錯的圖表。

(1) Ryan Heartfield, et al. Self-Configurable Cyber-Physical Intrusion Detection for Smart Homes Using Reinforcement Learning. TIFS.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(2) Congyuan Xu, et al. A Method of Few-Shot Network Intrusion Detection Based on Meta-Learning Framework. TIFS.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(4) Zhenlong Xiao, et al. Anomalous IoT Sensor Data Detection: An Efficient Approach Enabled by Nonlinear Frequency-Domain Graph Analysis. IOTJ.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(5) Hongda Li, et al. vNIDS: Towards Elastic Security with Safe and Efficient Virtualization of Network Intrusion Detection Systems. CCS.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(6) Ron Bitton, et al. A Machine Learning-Based Intrusion Detection System for Securing Remote Desktop Connections to Electronic Flight Bag Servers. TDSC.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(10) Jun Zeng, et al. WATSON: Abstracting Behaviors from Audit Logs via Aggregation of Contextual Semantics. NDSS.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例
[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

(11) Sunwoo Ahn, et al. Hawkware: Network Intrusion Detection based on Behavior Analysis with ANNs on an IoT Device. IEEE DAC.

[AI安全論文] 12.英文論文實驗評估(Evaluation)如何撰寫及精句摘抄(上)——以IDS為例

四.總結

這篇文章就寫到這里了,希望對您有所幫助。由于作者英語實在太差,論文的水平也很低,寫得不好的地方還請海涵和批評。同時,也歡迎大家討論,真心推薦原文。

? 版權聲明
THE END
喜歡就支持一下吧
點贊15 分享