2018-05-06

Python Keras 神经网络

Python 模块 Keras 支持 TensorFlow、CNTK 和 Theano 等多个深度学习框架，为研究者提供了简洁易用的 API 接口，在工业和学术界应用广泛。

Python Keras 安装

1 2	pip install keras pip install theano

更改 Keras 配置文件 ~/.keras/keras.json 将后台系统设置为 Theano

1 2	# change backend to theano vim ~/.keras/keras.json

用 Keras 实现一个简单的神经网络

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import SGD

# create train data using Numpy
# x is the feature matrix and y is the label vector
x_train = np.array([[0, 0],
                    [0, 1],
                    [1, 0],
                    [1, 1]])
y_train = np.array([[0],
                    [1],
                    [1],
                    [0]])

# load NN model
model = Sequential()

# specify the number of neurons in hidden layer
num_neurons = 10

# use Dense to create a fully connected feed forward network
model.add(Dense(num_neurons, input_dim=2))

# specify the activation function for the hidden layer
model.add(Activation('tanh'))

# create the output layer
model.add(Dense(1))
model.add(Activation('sigmoid'))
print(model.summary())

# specify the learning rate for the stochastic gradient descent algorithm
sgd = SGD(lr=0.1)

# specify the loss function
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])

# train the model and learn all the parameters
model.fit(x_train, y_train, epochs=1000)

2018-05-06

10-k Cross Validation

数据

数据索引文件 data.csv 包括 1935 条数据。将数据分成 10 等份，用于模型的交叉验证（cross validation）。

ID	task	file	structure	content	language	score
1	TASK1	161102007511.txt	3.0	3.0	3.0	9.0
2	TASK1	161102008210.txt	3.0	3.5	3.0	9.5
……
1935	TASK3	161102007425.txt	4.0	3.5	3.5	11

分隔数据

使用 R caret 包的 createFolds() 函数分割数据：

# load package caret 
library(caret)

# read in data
data <- read.csv('data.csv')

# create 10 random folds
folds <- createFolds(data$score)

# add a new column folds to the original data
data$folds <- 0

for ( i in 1: length(folds) ) {
    data$folds[folds[[i]]] = i
}

生成新的索引文件 data.csv：

ID	task	file	structure	content	language	score	folds
1	TASK1	161102007511.txt	3.0	3.0	3.0	9.0	3
2	TASK1	161102008210.txt	3.0	3.5	3.0	9.5	4
……
1935	TASK3	161102007425.txt	4.0	3.5	3.5	11	8

2018-05-01

Perl：Machine Learning

目前，大部分的机器学习框架均使用 Python、Java 等主流语言编写，Perl 在这一领域已经落后很多了。为了改善这种状况，Sergey Kolychev 用 Perl 编写了深度学习平台 MXNet 的 API 接口，并得到了 MXNet 官网的推荐。可从 MetaCPAN 网站下载 AI::MXNet。

2018-05-01

Perl：LWP 模块

Perl 模块 LWP 自动下载网页文件代码：

#!/usr/bin/perl

# usage:
# put this script and the index file in the same directory
# creat a directory for your files, e.g. C:/corpus
# and run the following command:
# perl downloader.pl C:/corpus 1 10
# all the files will be saved in the directory C:/corpus
# check the error_log for anything that goes wrong

use strict;
use warnings;

use File::Spec;
use LWP;

my $ua  = LWP::UserAgent->new;
my $dir = shift @ARGV;
my $start_id = shift @ARGV;
my $end_id   = shift @ARGV;

open my $INDEX, '<', 'index';
open my $LOG, '>', 'error_log';

while ( defined(my $row = <$INDEX>) ) {
    next if $. < $start_id + 1; # skip index header
    last if $. > $end_id + 1;
    chomp $row;
    my @records = split /\t/, $row;
    my $url     = $records[3]; # get url
    my $id      = $records[5]; # get id

    $url        = "http://www.thesite.com/Archives/" . $url;
    my $out_fn  = File::Spec->catfile($dir, sprintf("%06d", $id));
    open my $OUT, '>', $out_fn;

    print 'Downloading file ', $id, "\n";
    print 'url: ', $url, "\n";
    print 'save to: ', $out_fn, "\n";

    my $response = $ua->get($url);
    if ( $response->is_success ) {
        print $OUT $response->content;
        print 'Done!', "\n";
     } else {
        print $LOG $id, "\t", 'Error in ', $url, "\n";
        print 'Error!! Please check error_log for more information.', "\n";
    }
}

__END__

2018-04-29

R 读取外部数据

R 可以导入多种其它格式的数据，如 SAS，Excel，CSV 等。例如导入 SAS sample.sas7bdat 格式数据的方法如下：

# download *haven* package
install.packages("haven", repos='http://cran.rstudio.com/')

# convert SAS format to R dataframe
sample.df <- read_sas("sample.sas7bdat")

导入数据后可使用 R DataFrame 处理数据。

2018-04-23

Python：费舍尔精确检验

Python 模块 fisher 提供了费舍尔精确检验（Fisher’s Exact Test）的计算方法。利用该模块可以提取高区别度的特征，提高机器学习算法的准确率。

费舍尔精确检验（Fisher’s Exact Test）

作文自动评分模型通常需要提取高区别度的 n 元序列，用于判定高分和低分作文：

	Good Essay	Bad Essay
feature A	20	2
not feature A	2	20

从上例联列表可知包含 feature A 的作文分数较高。费舍尔精确检验可以统计特征的分布规律，通过 p 值筛选高分辨度的特征：

# import fisher module
from fisher import pvalue

# compute p value
fisher_val = pvalue(20, 2, 2, 20).twotail

对应 perl 代码：

use Inline Python => <<'END';
from fisher import pvalue

def get_pvalue(a,b,c,d):
   return pvalue(a,b,c,d).two_tail

END

my $fisher_val = get_pvalue(20, 2, 2, 20)

2018-04-22

R DataFrame 排序和筛选

数据

ID	Type	Term	Length	Freq	Cover	Fisher
1	p	nn	1	50127	1546	1.000000e+00
2	l	the	1	19479	1537	1.279193e-02
……
475290	t	the musics	2	1	1	1.0000000

特征文件 vocab 中包含基于词形、词元和词性的 n 元序列及其长度、频率、覆盖率和费舍尔精确检验的 p 值。为了提升后续算法的准确率，需要降低特征维度，筛选特征。方法如下：

R DataFrame 排序和筛选

# read in feature file
feature <- read.delim("vocab")

# reduce the number of features
refined.feature <- feature[which(feature$Freq > 10 & feature$Cover > 2 & feature$Fisher < 0.05),]

# rank features
refined.feature.order <- feature[order(freature$Fisher),]

实际应用中需要使用 R 的筛选功能调试各特征的组合，验证模型的准确率。

2018-03-19

Hexo 简介

本网站使用 Hexo 创建。在制作过程中参阅了许多有关 Hexo 的网络资料，如 Hexo 官方文档、Hexo 常见问题汇总和 Hexo GitHub 等。

Hexo 入门

Ubuntu 系统下安装和使用 Hexo 的步骤。

安装 Git

1	sudo apt-get install git-core

安装 Node.js

1	wget -qO- https://raw.github.com/creationix/nvm/master/install.sh \| sh

重启命令行界面后安装 Node.js

1	nvm install 4

安装 Hexo

1	npm install -g hexo-cli

阅读全文

2018-02-03

R 分层抽样

数据

数据索引文件 data.csv 中的 task 列包括 3 个类别：task 1，task 2 和 task 3。从各类别随机抽取 80% 的数据作为训练语料，其余 20% 用作测试语料。

ID	task	file	structure	content	language	score
1	TASK1	161102007511.txt	3.0	3.0	3.0	9.0
2	TASK1	161102008210.txt	3.0	3.5	3.0	9.5
……
1935	TASK3	161102007425.txt	4.0	3.5	3.5	11

抽样

使用 R sampling 包的 strata() 函数进行分层抽样，抽样后的训练和测试数据索引保存为 train.csv 和 test.csv。

# clear memory
rm(list=ls(all=T))

# load sampling package
library(sampling)

# read in data index
task.idx <- read.csv("data.csv")

# select train and test data
n <- round(4/5*nrow(task.idx)/3)
sub_train <- strata(task.idx, stratanames=("task"), size=rep(n, 3), method="srswor")
data_train <- task.idx[sub_train$ID_unit,]
data_test  <- task.idx[-sub_train$ID_unit,]

# save the index of train and test data
write.csv(data_train, file="train.csv", quote=F, row.names=F)
write.csv(data_test, file="test.csv", quote=F, row.names=F)

2018-02-03

阅读书目

近期阅读书目列表

书目	出版社	出版时间	作者
Deep Learning with R	Manning	2018	Francois Chollet
Perl Best Practices	O’Reilly	2005	Damian Conway
Making Sense – The Glamorous Story of English Grammar	Oxford University Press	2017	David Crystal
翻译研究	中国对外翻译出版公司	2001	思果