Phishing là một mối nguy hiểm thật sự trên Internet ngày nay. Vì vậy, cuộc chiến chống lại tấn
công phishing có ý nghĩa quan trọng. Trong bài báo này, chúng tôi đề xuất một giải pháp để giải
quyết vấn đề này bằng ứng dụng phương pháp lập trình Gen (GP) kết hợp với các phương pháp
lựa chọn đặc trưng để phát hiện phishing. Chúng tôi tiến hành các thí nghiệm trên tập dữ liệu bao
gồm cả phishing và các trang web hợp pháp được thu thập từ Internet, sau đó so sánh hiệu quả
thực hiện của GP với một số phương pháp học máy khác. Kết quả cho thấy GP là giải pháp tốt
nhất trong vấn đề phát hiện phishing
6 trang |
Chia sẻ: thucuc2301 | Lượt xem: 465 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Phishing attacks detection using genetic programming with features selection - Tuan Anh Pham, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26
21
PHISHING ATTACKS DETECTION USING GENETIC PROGRAMMING
WITH FEATURES SELECTION
Tuan Anh Pham1, Thi Huong Chu2, Hoang Quan Nguyen2,
Quang Uy Nguyen2, Xuan Hoai Nguyen3, Van Truong Nguyen4
1Centre of IT, Military Academy of Logistics, Vietnam,
2The Faculty of Information Technology, Le Quy Don University, Vietnam,
3IT R&D Center, Hanoi University, Vietnam,
4College of Education, TNU, Vietnam
SUMMARY
Phishing is a real threat on the Internet nowadays. Therefore, fighting against phishing attacks is of
great importance. In this paper, we propose a solution to this problem by applying Genetic
Programming with features selection methods to phishing detection problem. We conducted the
experiments on a data set including both phishing and legitimate sites collected from the Internet.
We compared the performance of Genetic Programming with a number of other machine learning
techniques and the results showed that Genetic Programming produced the best solutions to
phishing detection problem.
Keywords: Genetic Programming, Phishing Attack, Machine Learning
INTRODUCTION*
Genetic Programming (GP) [2] is an
evolutionary algorithm aimed to provide
solutions to a user-defined task in the form of
computer programs. Since its introduction,
GP has been applied to many practical
problems [2]. GP has also been used as a
learning tool for solving some problems in
network security [3]. However, to the best of
our knowledge, there has not been any
published work on the use of GP for learning
to detect phishing web sites except our
preliminary work in [4].
In the field of network security, phishing
attack is one of the main threat on the Internet
nowadays [5]. Phishing attackers attempt to
acquire confidential information such as
usernames, passwords, and credit card details
by disguising as a trustworthy entity in an
online communication [5]. Due to the
simplicity, phishing attacks are very popular. .
According to a report released by an
American security firm, RSA, there have been
approximately 33,000 phishing attacks
globally each month in 2012, leading to a loss
of $687 million [1]. Therefore, detecting and
* Tel: 0915 016063, Email: nvtruongtn@gmail.com
eliminating phishing attacks is very important
for not only organizations but also
individuals. One popular and widely-used
solution with most web browsers is to
integrate blacklisted sites into them.
However, this solution, which is unable to
detect a new attack if the database is out of
date, appears to be not effective when there
are a large number of phishing attacks carried
out very day.
In a recent research [4], Pham et al. proposed
a solution to this problem by applying
Genetic Programming to phishing detection
problem. The results showed that GP
outperforms some other machine learning
methods on this important problem. However,
the research in [4] has some drawbacks.
1) The data set for training and testing was
rather small. Therefore, the models created
based on this data set may not generalize well
in the real environment.
2) More important, the number of features
used in [4] seems to be limited. Moreover,
some features may not be relevant for
distinguishing between phishing and
legitimate sites. This may hinder the
performance of machine learning methods in
solving this problem.
Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26
22
In this paper, we extend the work in [4] in
several ways. The main contributions of this
paper are:
1) We enlarged both training and testing data
set by collecting more phishing and legitimate
sites from the Internet.
2) We enriched the features set by adding
some institutive features which may be
beneficial for discriminating normal and
phishing sites.
3) We used a features selection method to
eliminate some irrelevant features which
helps to improve the performance Genetic
Programming.
The remainder of the paper is organized as
follows. In the next section, we briefly review
some previous research on detecting phishing
attacks. In Section III we present our method
using GP for solving the phishing detection
problem. It is followed by a section detailing
our experimental settings. The experimental
results are shown and discussed in Section V.
The last section concludes the paper and
highlights some potential future works.
RELATED WORKS
Since phishing attacks are very popular, there
has been a number of anti-phishing solutions
proposed to date. Some methods aim to solve
the phishing problem at the email level by
preventing users from visiting the phishing
sites. That is, the emails containing phishing
sites are filtered before being able to reach to
the potential victims. Apparently, these
techniques are closely related to anti-spam
research and has been used by both Microsoft
[6] and Yahoo [7]. Other solutions attempt to
protect valuable information from being
exposed to the phishers by replacing
passwords with site-specific tokens, or by
using novel authentication mechanisms.
These methods have been used in some
popular anti-phishing tools such as PwdHash
and AntiPhish.
In PwdHash [8], a domain-specific password,
that is rendered useless if it is submitted to
another domain, is created (e.g., a password
for www.gmail.com will be different if
submitted to www.attacker.com). Conversely,
AntiPhish [9] takes a different approach by
keeping track of where confidential
information such as a password is being
submitted. That is, if it detects that a
password is being entered into a form on an
untrusted web site, a warning is generated and
the current operation is canceled.
In this paper, we will focus on the approaches
that only use the information available from
the URL and the pages source code.
Currently, there are two main such
approaches for identifying phishpages - based
on URL blacklists; and based on the
properties of the page and (sometimes) the
URL. More detailed description about these
methods can be found in [4].
METHODS
This section presents the methods used in this
paper. The way to extract the features for
each web site is presented first. The method
for features selection is discussed after that.
Finally, the GP system for phishing detection
is described.
Features Extraction
The first step of using GP to tackle the
phishing detection problem is features
extraction/selection. The extracted features
must contain information that helps to
distinguish phishing and legitimate sites. In
this paper, we extend the features set in [4] by
adding some more features that are based on
URL of the sites. Totally, eighteen features
are used in this paper including twelve
content-based features that have been used in
[4] and six new URL-based features. These
six URL-based features are taken from [10]
and are described as follows.
• URL1: number of ’@’ in URL (X13).
• URL2: number of ’-’ in URL (X14).
• URL3: number of ’.’ in URL (X15).
• URL4: number of ’.’ in URL (X16).
Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26
23
• URL5: 1 if URL contain word ’ebayisapi’,
otherwise 0 (X17).
• URL6: 1 if URL contain word ’banking’,
otherwise 0 (X18).
Features Selection
Feature selection is the process of choosing a
subset of features relevant to a particular
application [11]. There have been a number of
features selection proposed for machine
learning algorithms [12]. Among them,
statistics based methods have shown good
performance on a number of problems [12].
In this paper, we use the mutual information
(MI) concept as the features selection
criterion.
Mutual information (MI) is a basic concept in
information theory. It is a measure of general
interdependence between random variables
[12]. Specifically, given two random
variables X and Y, the mutual information
I(X;Y) is defined as follows:
I (X ; Y ) = H (X ) + H (Y ) − H (X ; Y ) (1)
where H() is the entropy of a random variable
and measures the uncertainty associated with
it. If X is a discrete random variable, H(X) is
defined as follows:
H (X ) = −∑ P (X )log2 (P (X )) (2)
Calculating exactly mutual information (MI)
between two random variables is not a
straightforward task. Therefore, it is often
necessary that this value is estimated. In this
paper, we estimate MI using the histogram
approach [12]. According to this method, the
probability density function of each variable
is approximated using a histogram. Then, the
MI can be calculated according to the
following equation:
)()(
),(
log),();( 2
YPXP
YXP
YXPYXI
x y
(3)
where the summations are calculated over the
appropriately discretized values of the
random variables X and Y. For each
histogram bin, the joint probability
distribution P(X,Y) is estimated by counting
the number of cases that fall into a particular
bin and dividing that number with the total
number of cases. The same technique is
applied for the histogram approximation of
the marginal distributions P(X) and P(Y).
Choosing an appropriate bin is a crucial issue.
In this paper, we follow [19] in choosing the
number of bins based on the Gaussianity rule.
With Gaussian data, the proper number of
bins is log2 N + 1.
System Description
The evolutionary learning process of GP for
solving the problem of phishing detection is
divided into two stages: training and testing.
The objective of training stage is to evolve the
model (the classifier) that can determine a site
as either phishing or legitimate based on its
feature values. In the testing stage, the learnt
model is used to make predictions on the
unseen data. The accuracy of this prediction is
used as an indicator for the quality
(effectiveness) of the model.
In the training stage, a set of training sites
(both phishing and benign) with their labels
(either as phishing or normal) are provided.
The feature extraction process is called to
convert every site to a feature vector. This
vector is then served as the input for an
individual in GP and the output of the
individual is a real value. If this real value is
greater than zero, this site is tagged as a
phishing, otherwise it is considered as benign.
The next step in the training process is to
measure the fitness of an individual in GP. In
this paper, we use a simple way to measure
the fitness of individual where the fitness is
the percentage of sites in the training set that
are correctly classified. This fitness, thought
may not be a good indicator if the data is
imbalance, is intuitive to identify the overall
quality of a model.
EXPERIMENTAL SETTINGS
This section outlines the settings used in our
experiments. First, we present the way that
Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26
24
data was collected for training and testing the
systems. After that GP configurations for the
experiments are described.
Data Collection
The data used for training and testing the
system in this paper was collected from both
phishing sites and legitimate sites on the
Internet. The process is similar to that in[4]
except the number of pages is larger. In this
paper, we collected 3528 phishing pages and
3965 normal pages.
From the data set, eighteen properties on each
page were extracted to create the set feature
vectors. We retained only one feature vector
in case there is duplication in the data set.
Moreover, if a feature vector presented in
both phishing data and legitimate data, this
vector was removed. As a result, 1800 feature
vectors for phishing and 1200 feature vectors
for legitimate data were retrieved. Totally, we
obtained 1800+1200=3000 feature vectors of
both phishing and legitimate sites. These
vectors are mixed and divided into two sets:
one for training (1000 samples) and the other
for testing (the rest). Finally, feature values
were normalized to the range between (0, 1),
and the vectors extracted from phishing pages
were labeled 1, otherwise labeled 0.
GP Parameters Settings
To tackle a problem with GP, several
elements need to be clarified beforehand.
These elements often depend on the problem
and the experience of practitioners. The first
and important element is the fitness function.
As aforementioned, in this paper we use the
percentage of correct classifications as the
fitness measurement for each individual in the
population. Other factors that strongly affect
the performance of GP are the set of non-
terminals and terminals. The terminal sets
include 18 variables (X1, X2,...,X18 )
representing 18 features extracted from the
sites. The non-terminal set include 5 functions
(+, -, *, /, iff). Here, we used the protected
versions of division (/), meaning that if the
denominator is zero, the returned value is 1.
Other evolutionary parameters are kept the
same as [2].
We divided our experiments into three sets. In
the first, we repeated the experiments in [4]
meaning that we used only twelve features
from X1 to X12. However, the data sets for
both training and testing in this experiment
are much larger than those in [4]. We used
1000 samples for training and 2000 for testing
(compared with only 516 and 288 for training
and testing samples in [4]). The objective of
this experiment is to see if the performance of
GP on a larger data set is still maintained.
In the second set we aimed to examine the
impact of enriching the features set to the
performance of GP in phishing detection
problem. Similar to the experiment in [4], we
also compared the performance of GP with
several well-known machine learning
techniques including Support Vector
Machines, Artificial Neural Networks and
Bayesian Networks.
In the third set, we investigated the impact of
features selection scheme that are based on
the mutual information to the performance of
all tested machine learning methods. This
experimental set aims to see if using the
features selection method help to remove
some irrelevant features and leading to the
better performance of learning methods. The
detail about these experiments are presented
in the following section.
RESULTS AND DISCUSSION
To determine quality of the models produced
by GP, at the end of each run, we selected the
best-of-the-run individual (the individual with
the best fitness on the training set in the entire
run). This model is then tested on the testing
set and the output on the testing set is
considered as the prediction error of the
model. In order to experiment other machine
learning techniques to solve the problem, we
Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26
25
used their implementations in Weka. We
compare the results produced by these
methods with the results obtained by GP. The
percentage of correct prediction of these
methods in three experiments (Exp) is
presented in Table 1. In this Table, GP is the
results produced by genetic programming.
SVM is shorthanded for Support Vector
Machine while ANN stands for the Artificial
Neuron Network. It should be noted that in all
Figures, the greater values are better.
It can be seen that the results in Table 1 are
consistent with the results in [4]. It confirms
that the best model produced by GP is also
the best model among all models produced by
all learning systems. Overall, the prediction
accuracy of GP learnt model is about 71% in
the first experiment. These values of other
methods ranges from 54% to 67% with the
lowest value is obtained by SVM while the
highest value is obtained by ANN.
Table 1. The Percentage of Correct Prediction
Exp GP SVM ANN BayesNet
Exp1 71.6 54.3 68.2 63.6
Exp2 76.3 56.5 74.2 73.1
Exp3 78.8 58.1 73.2 73.6
The second experimental set was aimed to
test if by adding more features (that are based
on URL) to the features set, we can obtain
better performance of these learning methods
on this problems. The results of the second
experiment are presented in the second row of
Table 1.
It can be seen that by enriching features set,
the performances of almost all learning
methods were improved. The most
remarkable improvement is achieved with
ANN and BayesNet. The accuracy of these
two methods increased to around 74%. With
AVM, its performance was also enhanced
from 54% to around 57%. However, what is
more important is that the performance of GP
is also improved and it still obtained the best
results amongst all tested techniques. The
results obtained by GP with this features set is
about 76%. In general, the results in this
experiment show the beneficial effect of
adding some URL-based features to the
features set in this problem.
The results in the second experimental set
show that enriching features set helps to
improve the performance of learning
algorithms in phishing detection problem.
However, this larger features set may also
contains some irrelevant features that might
hinder the performance of GP and other
learning methods. Therefore, this
experimental set aims to examine if using the
features selection method based on mutual
information helps to eliminate irrelevant
features and leading to the better
performance. We first calculated the mutual
information between each feature and the
label of the whole data set (including both
training and testing set). After that, we sorted,
in ascending order, the features based on its
mutual information with the label. We
omitted X8, X17 and X18 from the features
set due to its loosely related to the label and
we conducted the above experiments with the
new features set. The results are given in the
row 3 of Table 1. It can be seen from these
results that by using the features selection
technique to eliminate some irrelevant
features (X8, X17 and X18 in this paper), we
can achieve better performance for GP. While
the performance of other learning algorithms
is mostly the same with the experiment in the
second set, the performance of GP is keeping
enhanced and it obtains the best result in all
experiments at about 78%. Overall, the
experiments in this paper show the ability of
GP in tackling phishing detecting problem
and if we enrich the features set and using
features selection to eliminate irrelevant
features we can achieve rather good result, up
to approximate 80% of correct prediction.
Comparing to the best result in [4] with only
about 70%, this is a significant improvement.
Phạm Tuấn Anh và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 122(08): 21 - 26
26
CONCLUSIONS AND FUTURE WORK
In this paper, we conducted a more thorough
investigation on the use of Genetic
Programming (GP) for solving the problem of
detecting phishing attacks. We extended the
work in [4] by enriching features set and
using a features selection scheme to eliminate
some irrelevant features.
We compared the results produced by GP with
three other machine learning techniques (AVM,
ANN, Bayesian Networks). The results show
that GP is capable of producing the prediction
models (classifiers) that are more accurate than
other machine learning techniques. This result
inspires us to get GP integrated with blacklists-
based browsers to improve their ability in
detecting phishing attacks.
In the future, we are planning to extend the
work in this paper in a number of ways. First,
we want to continue enriching features set to
see if this helps to further improve the results.
Second, we want to give GP more
computational time (by increasing the
population size) to see if it can help GP to
find better models. Last but not least, we want
to make a more thorough analysis on the
obtained models to get better understanding of
the factors that affect the prediction accuracy.
REFERENCES
1. RSA, “Phishing in season: A look at online
fraud in 2012,” 2012.
2. R. Poli, W. Langdonand, and N. McPhee, A
Field Guide to Genetic Programming.
2008.
3. S. Mabu, C. Chen, N. Lu, K. Shimada, and K.
Hirasawa, “An intrusion-detection model based on
fuzzy class-association-rule mining using genetic
network programming,” IEEE Trans. on Systems,
Man, and Cybernetics, Part C, 41(2011), 130–139.
4. P. T. Anh, N. Q. Uy, and N. X. Hoai, “phishing
attacks detection using genetic programming,” in
The 5th Inter. Conf. on Knowledge and Systems
Eng., KSE, 2013.
5. C. Ludl, S. McAllister, E. Kirda, and C.
Kruegel, “On the effectiveness of techniques to
detect phishing sites,” in DIMVA. Springer, 2007.
6. Microsoft, “Sender id home page,” Website,
2007,
7. Yahoo, “Yahoo! antispam resource center,”
2007,
8. B. Ross, C. Jackson, N. Miyake, D. Boneh, and J.
C. Mitchell, “Stronger password authentication using
browser extensions,” in Pro. of the 14th USENIX
Security Symposium. USENIX, Aug. 2005.
9. H. Liu and H. Motoda, Feature Selection for
Knowledge Discovery and Data Mining. Kluwer
Academic Publishers, 1998.
10. M. Verleysen, F. Rossi, and D. Franois,
“Advances in feature selection with mutual
information,” 2009.
11. Scrapy, “Scrapy: web crawling framework,”
Org.
TÓM TẮT
PHÁT HIỆN TẤN CÔNG PHISHING SỬ DỤNG LẬP TRÌNH GEN VÀ LỰA
CHỌN CÁC ĐẶC TRƯNG
Phạm Tuấn Anh1, Chu Thị Hường2, Nguyễn Hoàng Quân2,
Nguyễn Quang Uy2, Nguyễn Xuân Hoài3, Nguyễn Văn Trường4*
1Học viện Hậu Cần Quân Đội,2Đại học Kỹ Thuật Lê Quý Đôn
,3Đại học Hà Nội,4Trường Đại học Sư phạm - ĐH Thái Nguyên
Phishing là một mối nguy hiểm thật sự trên Internet ngày nay. Vì vậy, cuộc chiến chống lại tấn
công phishing có ý nghĩa quan trọng. Trong bài báo này, chúng tôi đề xuất một giải pháp để giải
quyết vấn đề này bằng ứng dụng phương pháp lập trình Gen (GP) kết hợp với các phương pháp
lựa chọn đặc trưng để phát hiện phishing. Chúng tôi tiến hành các thí nghiệm trên tập dữ liệu bao
gồm cả phishing và các trang web hợp pháp được thu thập từ Internet, sau đó so sánh hiệu quả
thực hiện của GP với một số phương pháp học máy khác. Kết quả cho thấy GP là giải pháp tốt
nhất trong vấn đề phát hiện phishing.
Từ khóa: Lập trình di truyền, tấn công phishing, học máy
Ngày nhận bài:29/4/2014; Ngày phản biện:13/5/2014; Ngày duyệt đăng: 25/8/2014
Phản biện khoa học: TS. Vũ Việt Vũ – Trường Đại học Kỹ thuật Công nghiệp - ĐHTN
* Tel: 0915 016063, Email: nvtruongtn@gmail.com
Các file đính kèm theo tài liệu này:
- brief_48425_52340_9920151438284_6995_2046540.pdf