Phishing attacks detection using genetic programming with features selection - Tuan Anh Pham

Phishing là một mối nguy hiểm thật sự trên Internet ngày nay. Vì vậy, cuộc chiến chống lại tấn công phishing có ý nghĩa quan trọng. Trong bài báo này, chúng tôi đề xuất một giải pháp để giải quyết vấn đề này bằng ứng dụng phương pháp lập trình Gen (GP) kết hợp với các phương pháp lựa chọn đặc trưng để phát hiện phishing. Chúng tôi tiến hành các thí nghiệm trên tập dữ liệu bao gồm cả phishing và các trang web hợp pháp được thu thập từ Internet, sau đó so sánh hiệu quả thực hiện của GP với một số phương pháp học máy khác. Kết quả cho thấy GP là giải pháp tốt nhất trong vấn đề phát hiện phishing

6 trang | Chia sẻ: thucuc2301 | Lượt xem: 349 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Phishing attacks detection using genetic programming with features selection - Tuan Anh Pham, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26 21 PHISHING ATTACKS DETECTION USING GENETIC PROGRAMMING WITH FEATURES SELECTION Tuan Anh Pham1, Thi Huong Chu2, Hoang Quan Nguyen2, Quang Uy Nguyen2, Xuan Hoai Nguyen3, Van Truong Nguyen4 1Centre of IT, Military Academy of Logistics, Vietnam, 2The Faculty of Information Technology, Le Quy Don University, Vietnam, 3IT R&D Center, Hanoi University, Vietnam, 4College of Education, TNU, Vietnam SUMMARY Phishing is a real threat on the Internet nowadays. Therefore, fighting against phishing attacks is of great importance. In this paper, we propose a solution to this problem by applying Genetic Programming with features selection methods to phishing detection problem. We conducted the experiments on a data set including both phishing and legitimate sites collected from the Internet. We compared the performance of Genetic Programming with a number of other machine learning techniques and the results showed that Genetic Programming produced the best solutions to phishing detection problem. Keywords: Genetic Programming, Phishing Attack, Machine Learning INTRODUCTION* Genetic Programming (GP) [2] is an evolutionary algorithm aimed to provide solutions to a user-defined task in the form of computer programs. Since its introduction, GP has been applied to many practical problems [2]. GP has also been used as a learning tool for solving some problems in network security [3]. However, to the best of our knowledge, there has not been any published work on the use of GP for learning to detect phishing web sites except our preliminary work in [4]. In the field of network security, phishing attack is one of the main threat on the Internet nowadays [5]. Phishing attackers attempt to acquire confidential information such as usernames, passwords, and credit card details by disguising as a trustworthy entity in an online communication [5]. Due to the simplicity, phishing attacks are very popular. . According to a report released by an American security firm, RSA, there have been approximately 33,000 phishing attacks globally each month in 2012, leading to a loss of $687 million [1]. Therefore, detecting and * Tel: 0915 016063, Email: nvtruongtn@gmail.com eliminating phishing attacks is very important for not only organizations but also individuals. One popular and widely-used solution with most web browsers is to integrate blacklisted sites into them. However, this solution, which is unable to detect a new attack if the database is out of date, appears to be not effective when there are a large number of phishing attacks carried out very day. In a recent research [4], Pham et al. proposed a solution to this problem by applying Genetic Programming to phishing detection problem. The results showed that GP outperforms some other machine learning methods on this important problem. However, the research in [4] has some drawbacks. 1) The data set for training and testing was rather small. Therefore, the models created based on this data set may not generalize well in the real environment. 2) More important, the number of features used in [4] seems to be limited. Moreover, some features may not be relevant for distinguishing between phishing and legitimate sites. This may hinder the performance of machine learning methods in solving this problem. Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26 22 In this paper, we extend the work in [4] in several ways. The main contributions of this paper are: 1) We enlarged both training and testing data set by collecting more phishing and legitimate sites from the Internet. 2) We enriched the features set by adding some institutive features which may be beneficial for discriminating normal and phishing sites. 3) We used a features selection method to eliminate some irrelevant features which helps to improve the performance Genetic Programming. The remainder of the paper is organized as follows. In the next section, we briefly review some previous research on detecting phishing attacks. In Section III we present our method using GP for solving the phishing detection problem. It is followed by a section detailing our experimental settings. The experimental results are shown and discussed in Section V. The last section concludes the paper and highlights some potential future works. RELATED WORKS Since phishing attacks are very popular, there has been a number of anti-phishing solutions proposed to date. Some methods aim to solve the phishing problem at the email level by preventing users from visiting the phishing sites. That is, the emails containing phishing sites are filtered before being able to reach to the potential victims. Apparently, these techniques are closely related to anti-spam research and has been used by both Microsoft [6] and Yahoo [7]. Other solutions attempt to protect valuable information from being exposed to the phishers by replacing passwords with site-specific tokens, or by using novel authentication mechanisms. These methods have been used in some popular anti-phishing tools such as PwdHash and AntiPhish. In PwdHash [8], a domain-specific password, that is rendered useless if it is submitted to another domain, is created (e.g., a password for www.gmail.com will be different if submitted to www.attacker.com). Conversely, AntiPhish [9] takes a different approach by keeping track of where confidential information such as a password is being submitted. That is, if it detects that a password is being entered into a form on an untrusted web site, a warning is generated and the current operation is canceled. In this paper, we will focus on the approaches that only use the information available from the URL and the pages source code. Currently, there are two main such approaches for identifying phishpages - based on URL blacklists; and based on the properties of the page and (sometimes) the URL. More detailed description about these methods can be found in [4]. METHODS This section presents the methods used in this paper. The way to extract the features for each web site is presented first. The method for features selection is discussed after that. Finally, the GP system for phishing detection is described. Features Extraction The first step of using GP to tackle the phishing detection problem is features extraction/selection. The extracted features must contain information that helps to distinguish phishing and legitimate sites. In this paper, we extend the features set in [4] by adding some more features that are based on URL of the sites. Totally, eighteen features are used in this paper including twelve content-based features that have been used in [4] and six new URL-based features. These six URL-based features are taken from [10] and are described as follows. • URL1: number of ’@’ in URL (X13). • URL2: number of ’-’ in URL (X14). • URL3: number of ’.’ in URL (X15). • URL4: number of ’.’ in URL (X16). Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26 23 • URL5: 1 if URL contain word ’ebayisapi’, otherwise 0 (X17). • URL6: 1 if URL contain word ’banking’, otherwise 0 (X18). Features Selection Feature selection is the process of choosing a subset of features relevant to a particular application [11]. There have been a number of features selection proposed for machine learning algorithms [12]. Among them, statistics based methods have shown good performance on a number of problems [12]. In this paper, we use the mutual information (MI) concept as the features selection criterion. Mutual information (MI) is a basic concept in information theory. It is a measure of general interdependence between random variables [12]. Specifically, given two random variables X and Y, the mutual information I(X;Y) is defined as follows: I (X ; Y ) = H (X ) + H (Y ) − H (X ; Y ) (1) where H() is the entropy of a random variable and measures the uncertainty associated with it. If X is a discrete random variable, H(X) is defined as follows: H (X ) = −∑ P (X )log2 (P (X )) (2) Calculating exactly mutual information (MI) between two random variables is not a straightforward task. Therefore, it is often necessary that this value is estimated. In this paper, we estimate MI using the histogram approach [12]. According to this method, the probability density function of each variable is approximated using a histogram. Then, the MI can be calculated according to the following equation: )()( ),( log),();( 2 YPXP YXP YXPYXI x y  (3) where the summations are calculated over the appropriately discretized values of the random variables X and Y. For each histogram bin, the joint probability distribution P(X,Y) is estimated by counting the number of cases that fall into a particular bin and dividing that number with the total number of cases. The same technique is applied for the histogram approximation of the marginal distributions P(X) and P(Y). Choosing an appropriate bin is a crucial issue. In this paper, we follow [19] in choosing the number of bins based on the Gaussianity rule. With Gaussian data, the proper number of bins is log2 N + 1. System Description The evolutionary learning process of GP for solving the problem of phishing detection is divided into two stages: training and testing. The objective of training stage is to evolve the model (the classifier) that can determine a site as either phishing or legitimate based on its feature values. In the testing stage, the learnt model is used to make predictions on the unseen data. The accuracy of this prediction is used as an indicator for the quality (effectiveness) of the model. In the training stage, a set of training sites (both phishing and benign) with their labels (either as phishing or normal) are provided. The feature extraction process is called to convert every site to a feature vector. This vector is then served as the input for an individual in GP and the output of the individual is a real value. If this real value is greater than zero, this site is tagged as a phishing, otherwise it is considered as benign. The next step in the training process is to measure the fitness of an individual in GP. In this paper, we use a simple way to measure the fitness of individual where the fitness is the percentage of sites in the training set that are correctly classified. This fitness, thought may not be a good indicator if the data is imbalance, is intuitive to identify the overall quality of a model. EXPERIMENTAL SETTINGS This section outlines the settings used in our experiments. First, we present the way that Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26 24 data was collected for training and testing the systems. After that GP configurations for the experiments are described. Data Collection The data used for training and testing the system in this paper was collected from both phishing sites and legitimate sites on the Internet. The process is similar to that in[4] except the number of pages is larger. In this paper, we collected 3528 phishing pages and 3965 normal pages. From the data set, eighteen properties on each page were extracted to create the set feature vectors. We retained only one feature vector in case there is duplication in the data set. Moreover, if a feature vector presented in both phishing data and legitimate data, this vector was removed. As a result, 1800 feature vectors for phishing and 1200 feature vectors for legitimate data were retrieved. Totally, we obtained 1800+1200=3000 feature vectors of both phishing and legitimate sites. These vectors are mixed and divided into two sets: one for training (1000 samples) and the other for testing (the rest). Finally, feature values were normalized to the range between (0, 1), and the vectors extracted from phishing pages were labeled 1, otherwise labeled 0. GP Parameters Settings To tackle a problem with GP, several elements need to be clarified beforehand. These elements often depend on the problem and the experience of practitioners. The first and important element is the fitness function. As aforementioned, in this paper we use the percentage of correct classifications as the fitness measurement for each individual in the population. Other factors that strongly affect the performance of GP are the set of non- terminals and terminals. The terminal sets include 18 variables (X1, X2,...,X18 ) representing 18 features extracted from the sites. The non-terminal set include 5 functions (+, -, *, /, iff). Here, we used the protected versions of division (/), meaning that if the denominator is zero, the returned value is 1. Other evolutionary parameters are kept the same as [2]. We divided our experiments into three sets. In the first, we repeated the experiments in [4] meaning that we used only twelve features from X1 to X12. However, the data sets for both training and testing in this experiment are much larger than those in [4]. We used 1000 samples for training and 2000 for testing (compared with only 516 and 288 for training and testing samples in [4]). The objective of this experiment is to see if the performance of GP on a larger data set is still maintained. In the second set we aimed to examine the impact of enriching the features set to the performance of GP in phishing detection problem. Similar to the experiment in [4], we also compared the performance of GP with several well-known machine learning techniques including Support Vector Machines, Artificial Neural Networks and Bayesian Networks. In the third set, we investigated the impact of features selection scheme that are based on the mutual information to the performance of all tested machine learning methods. This experimental set aims to see if using the features selection method help to remove some irrelevant features and leading to the better performance of learning methods. The detail about these experiments are presented in the following section. RESULTS AND DISCUSSION To determine quality of the models produced by GP, at the end of each run, we selected the best-of-the-run individual (the individual with the best fitness on the training set in the entire run). This model is then tested on the testing set and the output on the testing set is considered as the prediction error of the model. In order to experiment other machine learning techniques to solve the problem, we Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26 25 used their implementations in Weka. We compare the results produced by these methods with the results obtained by GP. The percentage of correct prediction of these methods in three experiments (Exp) is presented in Table 1. In this Table, GP is the results produced by genetic programming. SVM is shorthanded for Support Vector Machine while ANN stands for the Artificial Neuron Network. It should be noted that in all Figures, the greater values are better. It can be seen that the results in Table 1 are consistent with the results in [4]. It confirms that the best model produced by GP is also the best model among all models produced by all learning systems. Overall, the prediction accuracy of GP learnt model is about 71% in the first experiment. These values of other methods ranges from 54% to 67% with the lowest value is obtained by SVM while the highest value is obtained by ANN. Table 1. The Percentage of Correct Prediction Exp GP SVM ANN BayesNet Exp1 71.6 54.3 68.2 63.6 Exp2 76.3 56.5 74.2 73.1 Exp3 78.8 58.1 73.2 73.6 The second experimental set was aimed to test if by adding more features (that are based on URL) to the features set, we can obtain better performance of these learning methods on this problems. The results of the second experiment are presented in the second row of Table 1. It can be seen that by enriching features set, the performances of almost all learning methods were improved. The most remarkable improvement is achieved with ANN and BayesNet. The accuracy of these two methods increased to around 74%. With AVM, its performance was also enhanced from 54% to around 57%. However, what is more important is that the performance of GP is also improved and it still obtained the best results amongst all tested techniques. The results obtained by GP with this features set is about 76%. In general, the results in this experiment show the beneficial effect of adding some URL-based features to the features set in this problem. The results in the second experimental set show that enriching features set helps to improve the performance of learning algorithms in phishing detection problem. However, this larger features set may also contains some irrelevant features that might hinder the performance of GP and other learning methods. Therefore, this experimental set aims to examine if using the features selection method based on mutual information helps to eliminate irrelevant features and leading to the better performance. We first calculated the mutual information between each feature and the label of the whole data set (including both training and testing set). After that, we sorted, in ascending order, the features based on its mutual information with the label. We omitted X8, X17 and X18 from the features set due to its loosely related to the label and we conducted the above experiments with the new features set. The results are given in the row 3 of Table 1. It can be seen from these results that by using the features selection technique to eliminate some irrelevant features (X8, X17 and X18 in this paper), we can achieve better performance for GP. While the performance of other learning algorithms is mostly the same with the experiment in the second set, the performance of GP is keeping enhanced and it obtains the best result in all experiments at about 78%. Overall, the experiments in this paper show the ability of GP in tackling phishing detecting problem and if we enrich the features set and using features selection to eliminate irrelevant features we can achieve rather good result, up to approximate 80% of correct prediction. Comparing to the best result in [4] with only about 70%, this is a significant improvement. Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26 26 CONCLUSIONS AND FUTURE WORK In this paper, we conducted a more thorough investigation on the use of Genetic Programming (GP) for solving the problem of detecting phishing attacks. We extended the work in [4] by enriching features set and using a features selection scheme to eliminate some irrelevant features. We compared the results produced by GP with three other machine learning techniques (AVM, ANN, Bayesian Networks). The results show that GP is capable of producing the prediction models (classifiers) that are more accurate than other machine learning techniques. This result inspires us to get GP integrated with blacklists- based browsers to improve their ability in detecting phishing attacks. In the future, we are planning to extend the work in this paper in a number of ways. First, we want to continue enriching features set to see if this helps to further improve the results. Second, we want to give GP more computational time (by increasing the population size) to see if it can help GP to find better models. Last but not least, we want to make a more thorough analysis on the obtained models to get better understanding of the factors that affect the prediction accuracy. REFERENCES 1. RSA, “Phishing in season: A look at online fraud in 2012,” 2012. 2. R. Poli, W. Langdonand, and N. McPhee, A Field Guide to Genetic Programming. 2008. 3. S. Mabu, C. Chen, N. Lu, K. Shimada, and K. Hirasawa, “An intrusion-detection model based on fuzzy class-association-rule mining using genetic network programming,” IEEE Trans. on Systems, Man, and Cybernetics, Part C, 41(2011), 130–139. 4. P. T. Anh, N. Q. Uy, and N. X. Hoai, “phishing attacks detection using genetic programming,” in The 5th Inter. Conf. on Knowledge and Systems Eng., KSE, 2013. 5. C. Ludl, S. McAllister, E. Kirda, and C. Kruegel, “On the effectiveness of techniques to detect phishing sites,” in DIMVA. Springer, 2007. 6. Microsoft, “Sender id home page,” Website, 2007, 7. Yahoo, “Yahoo! antispam resource center,” 2007, 8. B. Ross, C. Jackson, N. Miyake, D. Boneh, and J. C. Mitchell, “Stronger password authentication using browser extensions,” in Pro. of the 14th USENIX Security Symposium. USENIX, Aug. 2005. 9. H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998. 10. M. Verleysen, F. Rossi, and D. Franois, “Advances in feature selection with mutual information,” 2009. 11. Scrapy, “Scrapy: web crawling framework,” Org. TÓM TẮT PHÁT HIỆN TẤN CÔNG PHISHING SỬ DỤNG LẬP TRÌNH GEN VÀ LỰA CHỌN CÁC ĐẶC TRƯNG Phạm Tuấn Anh1, Chu Thị Hường2, Nguyễn Hoàng Quân2, Nguyễn Quang Uy2, Nguyễn Xuân Hoài3, Nguyễn Văn Trường4* 1Học viện Hậu Cần Quân Đội,2Đại học Kỹ Thuật Lê Quý Đôn ,3Đại học Hà Nội,4Trường Đại học Sư phạm - ĐH Thái Nguyên Phishing là một mối nguy hiểm thật sự trên Internet ngày nay. Vì vậy, cuộc chiến chống lại tấn công phishing có ý nghĩa quan trọng. Trong bài báo này, chúng tôi đề xuất một giải pháp để giải quyết vấn đề này bằng ứng dụng phương pháp lập trình Gen (GP) kết hợp với các phương pháp lựa chọn đặc trưng để phát hiện phishing. Chúng tôi tiến hành các thí nghiệm trên tập dữ liệu bao gồm cả phishing và các trang web hợp pháp được thu thập từ Internet, sau đó so sánh hiệu quả thực hiện của GP với một số phương pháp học máy khác. Kết quả cho thấy GP là giải pháp tốt nhất trong vấn đề phát hiện phishing. Từ khóa: Lập trình di truyền, tấn công phishing, học máy Ngày nhận bài:29/4/2014; Ngày phản biện:13/5/2014; Ngày duyệt đăng: 25/8/2014 Phản biện khoa học: TS. Vũ Việt Vũ – Trường Đại học Kỹ thuật Công nghiệp - ĐHTN * Tel: 0915 016063, Email: nvtruongtn@gmail.com

Các file đính kèm theo tài liệu này:

brief_48425_52340_9920151438284_6995_2046540.pdf