基于卷积神经网络的系统日志分类

0.png
卷积神经网络(CNN)是当前广泛应用的深度学习神经网络类型之一,特点是可以自动对数据特征进行提取学习。CNN 在图片识别,语义分析等领域已经取得了非常令人振奋的成绩,本篇将介绍使用 CNN 对系统日志进行识别分类的具体方法。在通过阅读本篇内容您将了解到:
- 文本数据进行 CNN 分析的相关预处理方法;
- CNN 一维卷积网络的具体构建和用法;
- CNN 对系统日志进行分类的具体应用;

系统日志

系统日志泛指运行于计算机上的软件系统所产生的相关记录信息,通常以文本文件的形式存在。系统日志包含了大量的关于系统运行、操作使用等相关情况的原始记录,对于一家企业来说是非常宝贵的数据资产。如何更好的分析挖掘海量系统日志中包含的有意信息,近年来人们的关注度也是在不断升温。海量日志分析是一个系统性的工程,包含了从原始数据采集到终端可视化的展示交互等一系列的环节,我们这里将介绍的是在日志采集中使用 CNN 神经网络对日志类型进行自动分类的具体应用。

问题描述

在海量系统日志数据采集的过程中经常需要对日志数据进行分类,这些分类工作通常是需要工程师或用户来事先设置指定的。但在实际应用中经常会出现日志类型指定错误或不知道所属类型的情况,经常导致需要对日志进行重新采集,或分析挖掘达不到预期的效果等问题。本篇所介绍的基于 CNN 神经网络的日志类型识别将有效的解决类似问题。

CNN 日志分类

用于本示例的数据样本来自于 MonitorWare Log Samples,一共17种日志类型,为示例简化每种日志只选取5条记录,数据内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
Log Samples---Name
Mar 29 2004 09:55:23: %PIX-6-302005: Built UDP connection for faddr 193.192.160.244/3053 gaddr 10.0.0.187/53 laddr 192.168.0.2/53---Cisco PIX
Mar 29 2004 09:55:23: %PIX-6-302006: Teardown UDP connection for faddr 193.192.160.244/3053 gaddr 10.0.0.187/53 laddr 192.168.0.2/53---Cisco PIX
Mar 29 2004 09:55:23: %PIX-6-302005: Built UDP connection for faddr 193.192.160.244/3053 gaddr 10.0.0.187/53 laddr 192.168.0.2/53---Cisco PIX
Mar 29 2004 09:55:25: %PIX-6-302005: Built UDP connection for faddr 66.196.65.40/51250 gaddr 10.0.0.187/53 laddr 192.168.0.2/53---Cisco PIX
Mar 29 2004 09:55:31: %PIX-6-302001: Built outbound TCP connection 152017 for faddr 212.56.240.37/9200 gaddr 10.0.0.187/2795 laddr 192.168.0.2/2795 ()---Cisco PIX
Sun, 2004-03-28 15:30:45 - TCP packet - Source:172.21.0.1,4662 ,LAN - Destination:80.142.227.227,4662 ,WAN [Drop] - [TCP preconnect traffic dropped]---NetGear FWG114P
Sun, 2004-03-28 15:31:34 - TCP packet - Source:82.82.111.67,4662 ,WAN - Destination:217.224.147.21,4762 ,LAN [Drop] - [TCP preconnect traffic dropped]---NetGear FWG114P
Sun, 2004-03-28 15:31:39 - TCP packet - Source:80.5.99.100,4662 ,WAN - Destination:217.224.147.21,4788 ,LAN [Drop] - [TCP preconnect traffic dropped]---NetGear FWG114P
Sun, 2004-03-28 15:36:10 - TCP packet - Source:140.112.243.228,5442 ,WAN - Destination:217.224.147.21,3283 ,LAN [Drop] - [TCP preconnect traffic dropped]---NetGear FWG114P
Sun, 2004-03-28 15:39:54 - TCP packet - Source:217.234.212.2,0 ,WAN - Destination:217.224.147.21,0 ,LAN [Drop] - [Fragment Attack]---NetGear FWG114P
Mar 12 12:00:08 server2 rcd[308]: Loaded 12 packages in 'ximian-red-carpet|351' (0.01878 seconds)---SuSE SLES 8
Mar 12 12:00:08 server2 rcd[308]: id=304 COMPLETE 'Downloading https://server2/data/red-carpet.rdf' time=0s (failed)---SuSE SLES 8
Mar 12 12:00:08 server2 rcd[308]: Unable to downloaded licenses info: Unable to authenticate - Authorization Required (https://server2/data/red-carpet.rdf)---SuSE SLES 8
Mar 12 12:10:00 server2 /USR/SBIN/CRON[6808]: (root) CMD ( /usr/lib/sa/sa1 )---SuSE SLES 8
Mar 12 12:20:00 server2 /USR/SBIN/CRON[6837]: (root) CMD ( /usr/lib/sa/sa1 )---SuSE SLES 8
Mar 12 12:01:02 server4 snort: alert_multiple_requests: ACTIVE---RedHat Enterprise Linux
Mar 12 12:01:02 server4 snort: telnet_decode arguments:---RedHat Enterprise Linux
Mar 12 12:01:02 server4 snort: snort startup succeeded---RedHat Enterprise Linux
Mar 12 12:01:02 server4 snort: Ports to decode telnet on: 21 23 25 119---RedHat Enterprise Linux
Mar 12 12:01:03 server4 snort: Snort initialization completed successfully---RedHat Enterprise Linux
Mar 10 03:19:48 server5 syslog: su : + tty?? root-informix---HP-UX B.10.20
Mar 11 03:19:54 server5 syslog: su : + tty?? root-informix---HP-UX B.10.20
Mar 12 03:19:51 server5 syslog: su : + tty?? root-informix---HP-UX B.10.20
Mar 12 09:27:20 server5 syslog: su : - ttyp1 user-informix---HP-UX B.10.20
Mar 12 09:27:35 server5 syslog: su : + ttyp1 user-informix---HP-UX B.10.20
Mar 12 08:24:51 server6 sshd[24742]: Accepted password for netscape from 111.222.333.444 port 1420 ssh2---HP UX B.11.00
Mar 12 08:25:15 server6 tftpd[24241]: Timeout (no requests in 10 minutes)---HP UX B.11.00
Mar 12 08:49:53 server6 ftpd[27281]: FTP LOGIN FROM 111.222.333.444 [111.222.333.444], netscape---HP UX B.11.00
Mar 12 09:05:22 server6 ftpd[27281]: exiting on signal 14---HP UX B.11.00
Mar 12 12:32:24 server6 sshd[11187]: Accepted password for jfalgout from 111.222.333.444 port 34138 ssh2---HP UX B.11.00
Mar 12 11:44:20 server7 ftpd[25306]: Goodbye.---HP UX B.11.11
Mar 12 11:44:35 server7 tftpd[24955]: Timeout (no requests in 10 minutes)---HP UX B.11.11
Mar 12 12:17:03 server7 sshd[26501]: pam_authenticate: error Authentication failed---HP UX B.11.11
Mar 12 12:17:03 server7 sshd[26501]: Accepted publickey for user from 111.222.333.444 port 32774 ssh2---HP UX B.11.11
Mar 12 12:34:23 server7 sshd[27393]: pam_authenticate: error Authentication failed---HP UX B.11.11
Mar 16 00:00:08 evita postfix/smtpd[1713]: connect from dialpool-210-214-5-215.maa.sify.net[210.214.5.215]---Postfix
Mar 16 00:00:09 evita postfix/smtpd[1713]: NOQUEUE: reject: RCPT from dialpool-210-214-5-215.maa.sify.net[210.214.5.215]: 554 Service unavailable; Client host [210.214.5.215] blocked using dnsbl.sorbs.net; Dynamic IP Address See: http://www.dnsbl.sorbs.net/cgi-bin/lookup?IP=210.214.5.215; from= to= proto=SMTP helo=---Postfix
Mar 16 00:00:11 evita postfix/smtpd[1713]: disconnect from dialpool-210-214-5-215.maa.sify.net[210.214.5.215]---Postfix
Mar 16 00:01:25 evita postfix/smtpd[1713]: connect from camomile.cloud9.net[168.100.1.3]---Postfix
Mar 16 00:01:28 evita postfix/smtpd[1713]: EA11834022: client=camomile.cloud9.net[168.100.1.3]---Postfix
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846---Apache
64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523---Apache
64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291---Apache
64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352---Apache
64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253---Apache
Feb 2 09:00:14 avas.example.com amavisd[11568]: Perl version 5.008001---Amavis-New
Feb 2 09:00:14 avas.example.com amavisd[11568]: Module Amavis::Conf 1.15---Amavis-New
Feb 2 09:00:14 avas.example.com amavisd[11568]: Module Archive::Tar 1.07---Amavis-New
Feb 2 09:00:14 avas.example.com amavisd[11568]: Module Archive::Zip 1.08---Amavis-New
Feb 2 09:00:14 avas.example.com amavisd[11568]: Module Compress::Zlib 1.31---Amavis-New
Mar 7 04:05:00 avas CROND[11233]: (cronjob) CMD (/usr/bin/mrtg /etc/mrtg/mrtg.cfg)---Cron Daemon
Mar 7 04:05:00 avas CROND[11234]: (mailman) CMD (/usr/local/bin/python -S /usr/local/mailman/cron/gate_news)---Cron Daemon
Mar 7 04:10:00 avas CROND[11253]: (cronjob) CMD (/usr/bin/mrtg /etc/mrtg/mrtg.cfg)---Cron Daemon
Mar 7 04:10:00 avas CROND[11254]: (cronjob) CMD (/usr/lib/sa/sa1 1 1)---Cron Daemon
Mar 7 04:10:00 avas CROND[11257]: (cronjob) CMD (/sbin/dcccollect.sh)---Cron Daemon
Mar 6 03:52:07 avas dccd[13284]: 1.2.32 database /home/dcc/dcc_db reopened with 997 MByte window---Distributed Checksum Clearinghouse Server
Mar 6 04:12:03 avas dccd[13284]: "packet length 44 too small for REPORT" sent to client 1 at 80.8.131.68,41000---Distributed Checksum Clearinghouse Server
Mar 6 19:01:12 avas dccd[13284]: no incoming flood connection from dcc1.example.no, server-ID XXXX---Distributed Checksum Clearinghouse Server
Mar 6 19:01:42 avas dccd[13284]: no outgoing flood connection to dcc1.example.no, server-ID XXXX---Distributed Checksum Clearinghouse Server
Mar 6 20:06:37 avas dccd[13284]: "packet length 44 too small for REPORT" sent to client 1 at 194.63.250.215,56007---Distributed Checksum Clearinghouse Server
Mar 12 13:23:58 avas sshd[23510]: Failed none for illegal user phil from 10.0.0.153 port 2006 ssh2---Red Hat Linux Server
Mar 12 13:23:58 avas sshd[23510]: Failed keyboard-interactive for illegal user phil from 10.0.0.153 port 2006 ssh2---Red Hat Linux Server
Mar 12 13:23:58 avas sshd[23510]: Disconnecting: Too many authentication failures for avas.cnc.bc.ca---Red Hat Linux Server
Mar 12 13:24:17 avas sshd[23522]: Could not reverse map address 10.0.0.153.---Red Hat Linux Server
Mar 12 13:24:17 avas sshd[23522]: Accepted password for tom from 10.0.0.153 port 2007 ssh2---Red Hat Linux Server
[13:35:15] [13:35:15] Scanning for directory /usr/lib/...... [13:35:15] OK. Not found.---RK Hunter
[13:35:15] [13:35:15] Scanning for directory /usr/lib/.../bkit-ssh... [13:35:15] OK. Not found.---RK Hunter
[13:35:15] [13:35:15] Scanning for directory /usr/lib/.bkit-... [13:35:15] OK. Not found.---RK Hunter
[13:35:15] [13:35:15] Scanning for directory /tmp/.bkp... [13:35:15] OK. Not found.---RK Hunter
[13:35:15] [13:35:15] *** Start scan CiNIK Worm (Slapper.B variant) ***---RK Hunter
[Sun Mar 7 05:39:40 2004] up2date new up2date run started---Up 2 Date
[Sun Mar 7 05:39:40 2004] up2date Opening rpmdb in /var/lib/rpm/ with option 0---Up 2 Date
[Sun Mar 7 05:39:40 2004] up2date Opening rpmdb in /var/lib/rpm/ with option 0---Up 2 Date
[Sun Mar 7 09:39:40 2004] up2date new up2date run started---Up 2 Date
[Sun Mar 7 09:39:40 2004] up2date Opening rpmdb in /var/lib/rpm/ with option 0---Up 2 Date
[11-29-2002 - 15:22:37] Client at 24.69.73.3: URL contains high bit character. Request will be rejected. Site Instance='1', Raw URL='/scripts/mail.exe/2001���.jpg'---URL Scan
[11-29-2002 - 15:22:47] Client at 24.69.73.3: URL contains high bit character. Request will be rejected. Site Instance='1', Raw URL='/scripts/mail.exe/2001���.jpg'---URL Scan
[11-29-2002 - 21:15:17] Client at 24.67.253.204: URL contains extension '.com', which is disallowed. Request will be rejected. Site Instance='1', Raw URL='/scripts/www.the5yearjournal.com'---URL Scan
[12-02-2002 - 09:52:33] Client at 142.27.68.15: URL contains high bit character. Request will be rejected. Site Instance='1', Raw URL='/scripts/mail.exe/2001%C2%A4%C3%AB%C2%BE%C3%A4.jpg'---URL Scan
[12-02-2002 - 09:52:43] Client at 142.27.68.15: URL contains high bit character. Request will be rejected. Site Instance='1', Raw URL='/scripts/mail.exe/2001%C2%A4%C3%AB%C2%BE%C3%A4.jpg'---URL Scan
(II) PCI: 00:00:0: chip 8086,2560 card 174b,174b rev 03 class 06,00,00 hdr 00---X Free 86
(II) PCI: 00:02:0: chip 8086,2562 card 174b,174b rev 03 class 03,00,00 hdr 00---X Free 86
(II) PCI: 00:1d:0: chip 8086,24c2 card 174b,174b rev 02 class 0c,03,00 hdr 80---X Free 86
(II) PCI: 00:1d:1: chip 8086,24c4 card 174b,174b rev 02 class 0c,03,00 hdr 00---X Free 86
(II) PCI: 00:1d:2: chip 8086,24c7 card 174b,174b rev 02 class 0c,03,00 hdr 00---X Free 86

为避免与日志文本内容冲突,本 CSV 的列分隔符选用的是3个“-”。
本示例的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# CNN for the logs classificaton
import numpy
import pandas
import re
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Convolution1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.utils import np_utils
# utils functions define
def str2numbers(strs):
res = []
for str in strs:
wArr = []
words = re.split("[\s\/]",str)
for word in words:
wArr.append(sum([ord(c) for c in word]))
res.append(wArr)
return res
def predictions2className(classes,predictions):
res = []
for prediction in predictions:
index = numpy.argmax(prediction)
res.append(classes[index])
return res
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
dataframe = pandas.read_csv('sys-log.csv', sep='---', engine='python')
dataset = dataframe.values
numpy.random.shuffle(dataset)
X_str = dataset[:,0]
y_str = dataset[:,1]
train_size = int(len(X_str) * 0.8)
test_size = len(X_str) - train_size
X_num = str2numbers(X_str)
X_train = X_num[:train_size]
X_test = X_num[train_size:]
classes, y_num = numpy.unique(y_str, return_inverse=True)
y_oneHot = np_utils.to_categorical(y_num)
y_train = y_oneHot[:train_size]
y_test = y_oneHot[train_size:]
# pad dataset to a maximum review length in words
max_words = 100
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
# create the model
top_words = 10000
outputs = len(classes)
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=30, batch_size=1, verbose=1)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
predictions = model.predict(X_test, verbose=0)
result = predictions2className(classes, predictions)
print "############## Test logs: ",X_str[train_size:]
print "############## Log types: ",y_str[train_size:]
print "############## Prediction types: ",result

运行示例前,请先保存上文样本数据到本示例代码相同目录下,文件名为“sys-log.csv”。接下来我们将对上例关键代码部分进行详细讲解。

1
dataframe = pandas.read_csv('sys-log.csv', sep='---', engine='python')

通过 pandas 库对 CSV 数据进行加载,以“—”为列分隔符。

1
numpy.random.shuffle(dataset)

我们的样本数据是按日志类型进行编辑整理的,这里对加载后数据进行了顺序随机重排,以使样本数据更好的用于模型训练和测试。

1
2
train_size = int(len(X_str) * 0.8)
test_size = len(X_str) - train_size

我们划分样本数据的80%用于训练,20%用于测试。

1
X_num = str2numbers(X_str)

这里我们对日志文本内容进行了形式上的转换,将每个单词转换成了相应的整数,具体方法请参见str2numbers()函数内容。将单词转换为整数的作用是为后文的单词嵌入(word embedding)做准备,后面会详细介绍。

1
2
classes, y_num = numpy.unique(y_str, return_inverse=True)
y_oneHot = np_utils.to_categorical(y_num)

将日志类型的目标输出转换为 One-hot 的表示形式,主要由 Keras 工具类 np_utils 所提供的 to_categorical 方法来完成。

1
2
3
max_words = 100
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

进行模型处理前,我们将每组数据标准化为统一的长度,本例设置的是100,具体根据数据的大小而定。sequence类的pad_sequences方法可以根据maxlen参数指定的长度对传入数据的每一项进行处理。当数据长度小于maxlen时,自动左添0补全;当数据长度大于maxlen时,则从数据的右端开始进行maxlen长度保留。

1
model.add(Embedding(top_words, 32, input_length=max_words))

Embedding 层添加。单词嵌入(word embedding)是机器学习自然语言处理(NLP)中所使用的一种词汇处理方法,实现的是将单词转换为高维空间的实向量,单词间在语义上越接近,在向量空间中的距离就越近,同时也使转换后的词汇矩阵维度更低。我们在此添加 Embedding 层的作用是将每条维度为1x100的日志数据转换为32x100矩阵形式,之后输入给卷积层进行特征提取。
top_words:输入数据的最大表示数值,也称为词汇容量(vocabulary size),输入数据词汇的最大数值要小于该值;
32:输出矩阵的行数;
input_length:输入数据长度。当模型后续有 Flatten 层时,该参数必须提供,以为其提供数据维度信息;

1
2
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))

Convolution1D 层具体实现的是一维卷积计算,功能原理与 Convolution2D 相同。这里设置的过滤器数量为32,过滤器大小为3,边缘处理模式为same,激活函数类型为relu
卷积层之后我们添加的是 MaxPooling1D 层,对上一层提取到的特征图谱进行取最大汇集操作,汇集长度为2

1
2
3
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(outputs, activation='softmax'))

数据有卷积结构汇集层输出后,我们添加了 Flatten 层将2维矩阵转换为1维向量,之后传递给全连接神经网络结构(隐含层:250个隐含神经元,输出层:outputs个神经元)。因为我们这里是要对输入结果做分类判断,所以输出的激活函数类型选择的是softmax
下面是运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Epoch 28/30
68/68 [==============================] - 0s - loss: 5.9754e-04 - acc: 1.0000 - val_loss: 0.3522 - val_acc: 0.8824
Epoch 29/30
68/68 [==============================] - 0s - loss: 5.4314e-04 - acc: 1.0000 - val_loss: 0.3508 - val_acc: 0.8824
Epoch 30/30
68/68 [==============================] - 0s - loss: 5.0137e-04 - acc: 1.0000 - val_loss: 0.3527 - val_acc: 0.8824
Accuracy: 88.24%
############## Test logs: [ 'Mar 29 2004 09:55:23: %PIX-6-302005: Built UDP connection for faddr 193.192.160.244/3053 gaddr 10.0.0.187/53 laddr 192.168.0.2/53'
'64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253'
'Sun, 2004-03-28 15:31:39 - TCP packet - Source:80.5.99.100,4662 ,WAN - Destination:217.224.147.21,4788 ,LAN [Drop] - [TCP preconnect traffic dropped]'
'Feb 2 09:00:14 avas.example.com amavisd[11568]: Module Archive::Zip 1.08'
'(II) PCI: 00:1d:1: chip 8086,24c4 card 174b,174b rev 02 class 0c,03,00 hdr 00'
'Mar 16 00:01:28 evita postfix/smtpd[1713]: EA11834022: client=camomile.cloud9.net[168.100.1.3]'
'Sun, 2004-03-28 15:36:10 - TCP packet - Source:140.112.243.228,5442 ,WAN - Destination:217.224.147.21,3283 ,LAN [Drop] - [TCP preconnect traffic dropped]'
'64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291'
'[Sun Mar 7 05:39:40 2004] up2date Opening rpmdb in /var/lib/rpm/ with option 0'
'(II) PCI: 00:00:0: chip 8086,2560 card 174b,174b rev 03 class 06,00,00 hdr 00'
'Mar 12 12:20:00 server2 /USR/SBIN/CRON[6837]: (root) CMD ( /usr/lib/sa/sa1 )'
'Mar 6 19:01:12 avas dccd[13284]: no incoming flood connection from dcc1.example.no, server-ID XXXX'
'Mar 12 09:27:20 server5 syslog: su : - ttyp1 user-informix'
'[13:35:15] [13:35:15] Scanning for directory /usr/lib/.bkit-... [13:35:15] OK. Not found.'
'Mar 12 08:24:51 server6 sshd[24742]: Accepted password for netscape from 111.222.333.444 port 1420 ssh2'
'[13:35:15] [13:35:15] Scanning for directory /tmp/.bkp... [13:35:15] OK. Not found.'
'Feb 2 09:00:14 avas.example.com amavisd[11568]: Module Archive::Tar 1.07']
############## Log types: ['Cisco PIX' 'Apache' 'NetGear FWG114P' 'Amavis-New' 'X Free 86' 'Postfix'
'NetGear FWG114P' 'Apache' 'Up 2 Date' 'X Free 86' 'SuSE SLES 8'
'Distributed Checksum Clearinghouse Server' 'HP-UX B.10.20' 'RK Hunter'
'HP UX B.11.00' 'RK Hunter' 'Amavis-New']
############## Prediction types: ['Cisco PIX', 'Apache', 'NetGear FWG114P', 'Amavis-New', 'X Free 86', 'RedHat Enterprise Linux', 'NetGear FWG114P', 'SuSE SLES 8', 'Up 2 Date', 'X Free 86', 'SuSE SLES 8', 'Distributed Checksum Clearinghouse Server', 'HP-UX B.10.20', 'RK Hunter', 'HP UX B.11.00', 'RK Hunter', 'Amavis-New']

测评的识别准确率为88.24%,结下来我们将对本模型进行扩展,看一下多层卷积结构模型的表现如何。

多层卷积结构

多层卷积结构模型扩展代码如下:

1
2
3
4
5
6
7
8
9
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Convolution1D(nb_filter=32, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(outputs, activation='softmax'))

我们在上例的基础上又增添了一层 Convolution1D 层和 MaxPooling1D 层。
下面是运行结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Epoch 18/20
68/68 [==============================] - 0s - loss: 0.0143 - acc: 1.0000 - val_loss: 0.6617 - val_acc: 0.9412
Epoch 19/20
68/68 [==============================] - 0s - loss: 0.0095 - acc: 1.0000 - val_loss: 0.6646 - val_acc: 0.8824
Epoch 20/20
68/68 [==============================] - 0s - loss: 0.0045 - acc: 1.0000 - val_loss: 0.6602 - val_acc: 0.9412
Accuracy: 94.12%
############## Test logs: [ 'Mar 29 2004 09:55:23: %PIX-6-302005: Built UDP connection for faddr 193.192.160.244/3053 gaddr 10.0.0.187/53 laddr 192.168.0.2/53'
'64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253'
'Sun, 2004-03-28 15:31:39 - TCP packet - Source:80.5.99.100,4662 ,WAN - Destination:217.224.147.21,4788 ,LAN [Drop] - [TCP preconnect traffic dropped]'
'Feb 2 09:00:14 avas.example.com amavisd[11568]: Module Archive::Zip 1.08'
'(II) PCI: 00:1d:1: chip 8086,24c4 card 174b,174b rev 02 class 0c,03,00 hdr 00'
'Mar 16 00:01:28 evita postfix/smtpd[1713]: EA11834022: client=camomile.cloud9.net[168.100.1.3]'
'Sun, 2004-03-28 15:36:10 - TCP packet - Source:140.112.243.228,5442 ,WAN - Destination:217.224.147.21,3283 ,LAN [Drop] - [TCP preconnect traffic dropped]'
'64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291'
'[Sun Mar 7 05:39:40 2004] up2date Opening rpmdb in /var/lib/rpm/ with option 0'
'(II) PCI: 00:00:0: chip 8086,2560 card 174b,174b rev 03 class 06,00,00 hdr 00'
'Mar 12 12:20:00 server2 /USR/SBIN/CRON[6837]: (root) CMD ( /usr/lib/sa/sa1 )'
'Mar 6 19:01:12 avas dccd[13284]: no incoming flood connection from dcc1.example.no, server-ID XXXX'
'Mar 12 09:27:20 server5 syslog: su : - ttyp1 user-informix'
'[13:35:15] [13:35:15] Scanning for directory /usr/lib/.bkit-... [13:35:15] OK. Not found.'
'Mar 12 08:24:51 server6 sshd[24742]: Accepted password for netscape from 111.222.333.444 port 1420 ssh2'
'[13:35:15] [13:35:15] Scanning for directory /tmp/.bkp... [13:35:15] OK. Not found.'
'Feb 2 09:00:14 avas.example.com amavisd[11568]: Module Archive::Tar 1.07']
############## Log types: ['Cisco PIX' 'Apache' 'NetGear FWG114P' 'Amavis-New' 'X Free 86' 'Postfix'
'NetGear FWG114P' 'Apache' 'Up 2 Date' 'X Free 86' 'SuSE SLES 8'
'Distributed Checksum Clearinghouse Server' 'HP-UX B.10.20' 'RK Hunter'
'HP UX B.11.00' 'RK Hunter' 'Amavis-New']
############## Prediction types: ['Cisco PIX', 'Apache', 'NetGear FWG114P', 'Amavis-New', 'X Free 86', 'Up 2 Date', 'NetGear FWG114P', 'Apache', 'Up 2 Date', 'X Free 86', 'SuSE SLES 8', 'Distributed Checksum Clearinghouse Server', 'HP-UX B.10.20', 'RK Hunter', 'HP UX B.11.00', 'RK Hunter', 'Amavis-New']

本例中多层卷积模型的测试准确率相对有所提高。需要注意的是我们这里主要是用于介绍多层卷积结构的构建方法,至于具体测试结果是否相对单层结构有所改进还是要具体问题具体分析。

总结

通过本篇的阅读,您知道了 CNN 在系统日志识别分类上的具体实现方法。本篇的示例代码还有很大的改进空间,希望您可以在此基础上继续改进取得更好的效果。同时也希望本篇所介绍的方法能应用到您的具体工作中去,让我们一同向世界展示深度学习的魅力:)

更多参考