Python Keras 神经网络

Python 模块 Keras 支持 TensorFlowCNTKTheano 等多个深度学习框架,为研究者提供了简洁易用的 API 接口,在工业和学术界应用广泛。

Python Keras 安装

1
2
pip install keras
pip install theano

更改 Keras 配置文件 ~/.keras/keras.json 将后台系统设置为 Theano

1
2
# change backend to theano
vim ~/.keras/keras.json

用 Keras 实现一个简单的神经网络

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD

# create train data using Numpy
# x is the feature matrix and y is the label vector
x_train = np.array([[0, 0],
                    [0, 1],
                    [1, 0],
                    [1, 1]])
y_train = np.array([[0],
                    [1],
                    [1],
                    [0]])

# load NN model
model = Sequential()

# specify the number of neurons in hidden layer
num_neurons = 10

# use Dense to create a fully connected feed forward network
model.add(Dense(num_neurons, input_dim=2))

# specify the activation function for the hidden layer
model.add(Activation('tanh'))

# create the output layer
model.add(Dense(1))
model.add(Activation('sigmoid'))
print(model.summary())

# specify the learning rate for the stochastic gradient descent algorithm
sgd = SGD(lr=0.1)

# specify the loss function
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

# train the model and learn all the parameters
model.fit(x_train, y_train, epochs=1000)

10-k Cross Validation

数据

数据索引文件 data.csv 包括 1935 条数据。将数据分成 10 等份,用于模型的交叉验证(cross validation)。

ID task file structure content language score
1 TASK1 161102007511.txt 3.0 3.0 3.0 9.0
2 TASK1 161102008210.txt 3.0 3.5 3.0 9.5
……
1935 TASK3 161102007425.txt 4.0 3.5 3.5 11

分隔数据

使用 R caret 包的 createFolds() 函数分割数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# load package caret 
library(caret)

# read in data
data <- read.csv('data.csv')

# create 10 random folds
folds <- createFolds(data$score)

# add a new column folds to the original data
data$folds <- 0

for ( i in 1: length(folds) ) {
data$folds[folds[[i]]] = i
}

生成新的索引文件 data.csv:

ID task file structure content language score folds
1 TASK1 161102007511.txt 3.0 3.0 3.0 9.0 3
2 TASK1 161102008210.txt 3.0 3.5 3.0 9.5 4
……
1935 TASK3 161102007425.txt 4.0 3.5 3.5 11 8

Perl:LWP 模块

Perl 模块 LWP 自动下载网页文件代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#!/usr/bin/perl

# usage:
# put this script and the index file in the same directory
# creat a directory for your files, e.g. C:/corpus
# and run the following command:
# perl downloader.pl C:/corpus 1 10
# all the files will be saved in the directory C:/corpus
# check the error_log for anything that goes wrong

use strict;
use warnings;

use File::Spec;
use LWP;

my $ua = LWP::UserAgent->new;
my $dir = shift @ARGV;
my $start_id = shift @ARGV;
my $end_id = shift @ARGV;

open my $INDEX, '<', 'index';
open my $LOG, '>', 'error_log';

while ( defined(my $row = <$INDEX>) ) {
next if $. < $start_id + 1; # skip index header
last if $. > $end_id + 1;
chomp $row;
my @records = split /\t/, $row;
my $url = $records[3]; # get url
my $id = $records[5]; # get id

$url = "http://www.thesite.com/Archives/" . $url;
my $out_fn = File::Spec->catfile($dir, sprintf("%06d", $id));
open my $OUT, '>', $out_fn;

print 'Downloading file ', $id, "\n";
print 'url: ', $url, "\n";
print 'save to: ', $out_fn, "\n";

my $response = $ua->get($url);
if ( $response->is_success ) {
print $OUT $response->content;
print 'Done!', "\n";
} else {
print $LOG $id, "\t", 'Error in ', $url, "\n";
print 'Error!! Please check error_log for more information.', "\n";
}
}

__END__

R 读取外部数据

R 可以导入多种其它格式的数据,如 SAS,Excel,CSV 等。例如导入 SAS sample.sas7bdat 格式数据的方法如下:

1
2
3
4
5
# download *haven* package
install.packages("haven", repos='http://cran.rstudio.com/')

# convert SAS format to R dataframe
sample.df <- read_sas("sample.sas7bdat")

导入数据后可使用 R DataFrame 处理数据。

Python:费舍尔精确检验

Python 模块 fisher 提供了费舍尔精确检验(Fisher’s Exact Test)的计算方法。利用该模块可以提取高区别度的特征,提高机器学习算法的准确率。

费舍尔精确检验(Fisher’s Exact Test)

作文自动评分模型通常需要提取高区别度的 n 元序列,用于判定高分和低分作文:

Good Essay Bad Essay
feature A 20 2
not feature A 2 20

从上例联列表可知包含 feature A 的作文分数较高。费舍尔精确检验可以统计特征的分布规律,通过 p 值筛选高分辨度的特征:

1
2
3
4
5
# import fisher module
from fisher import pvalue

# compute p value
fisher_val = pvalue(20, 2, 2, 20).twotail

对应 perl 代码:

1
2
3
4
5
6
7
8
9
use Inline Python => <<'END';
from fisher import pvalue

def get_pvalue(a,b,c,d):
return pvalue(a,b,c,d).two_tail

END

my $fisher_val = get_pvalue(20, 2, 2, 20)

R DataFrame 排序和筛选

数据

ID Type Term Length Freq Cover Fisher
1 p nn 1 50127 1546 1.000000e+00
2 l the 1 19479 1537 1.279193e-02
……
475290 t the musics 2 1 1 1.0000000

特征文件 vocab 中包含基于词形、词元和词性的 n 元序列及其长度、频率、覆盖率和费舍尔精确检验的 p 值。为了提升后续算法的准确率,需要降低特征维度,筛选特征。方法如下:

R DataFrame 排序和筛选

1
2
3
4
5
6
7
8
# read in feature file
feature <- read.delim("vocab")

# reduce the number of features
refined.feature <- feature[which(feature$Freq > 10 & feature$Cover > 2 & feature$Fisher < 0.05),]

# rank features
refined.feature.order <- feature[order(freature$Fisher),]

实际应用中需要使用 R 的筛选功能调试各特征的组合,验证模型的准确率。

R 分层抽样

数据

数据索引文件 data.csv 中的 task 列包括 3 个类别:task 1,task 2 和 task 3。从各类别随机抽取 80% 的数据作为训练语料,其余 20% 用作测试语料。

ID task file structure content language score
1 TASK1 161102007511.txt 3.0 3.0 3.0 9.0
2 TASK1 161102008210.txt 3.0 3.5 3.0 9.5
……
1935 TASK3 161102007425.txt 4.0 3.5 3.5 11

抽样

使用 R sampling 包的 strata() 函数进行分层抽样,抽样后的训练和测试数据索引保存为 train.csv 和 test.csv。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# clear memory
rm(list=ls(all=T))

# load sampling package
library(sampling)

# read in data index
task.idx <- read.csv("data.csv")

# select train and test data
n <- round(4/5*nrow(task.idx)/3)
sub_train <- strata(task.idx, stratanames=("task"), size=rep(n, 3), method="srswor")
data_train <- task.idx[sub_train$ID_unit,]
data_test <- task.idx[-sub_train$ID_unit,]

# save the index of train and test data
write.csv(data_train, file="train.csv", quote=F, row.names=F)
write.csv(data_test, file="test.csv", quote=F, row.names=F)