AI::NaiveBayes1 - Bayesian prediction of categories

Location : Home > Languages > Perl > Package
Title : AI::NaiveBayes1

　AI::NaiveBayes1 - ベイジアンによるカテゴリー予測

use AI::NaiveBayes1;
my $nb = AI::NaiveBayes1->new;

$nb->add_instances(attributes=>{model=>'H',place=>'B'},label=>'repairs=Y',cases=>30);
$nb->add_instances(attributes=>{model=>'H',place=>'B'},label=>'repairs=N',cases=>10);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},label=>'repairs=Y',cases=>18);
$nb->add_instances(attributes=>{model=>'H',place=>'N'},label=>'repairs=N',cases=>16);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},label=>'repairs=Y',cases=>22);
$nb->add_instances(attributes=>{model=>'T',place=>'B'},label=>'repairs=N',cases=>14);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=Y',cases=> 6);
$nb->add_instances(attributes=>{model=>'T',place=>'N'},label=>'repairs=N',cases=>84);

$nb->train;

print "Model:\n" . $nb->print_model;

# 未知のインスタンスに対し結果を探索
my $result = $nb->predict
   (attributes => {model=>'T', place=>'N'});

foreach my $k (keys(%{ $result })) {
   print "for label $k P = " . $result->{$k} . "\n";
}

# モデルを文字列にエクスポート
my $string = $nb->export_to_YAML();

# 文字列から同じモデルを生成
my $nb1 = AI::NaiveBayes1->import_from_YAML($string);

# ファイルにモデルを出力（model->string->file より短く）
$nb->export_to_YAML_file('t/tmp1');

# ファイルからモデルを読み出す（file->string->model より短く）
my $nb2 = AI::NaiveBayes1->import_from_YAML_file('t/tmp1');

　さらなる例は使用例を見よ。

メソッド

　本モジュールは古典的なナイーブ・ベイズ機械学習アルゴリズムを実装していている。

コンストラクタメソッド

new(): 　新しい AI::NaiveBayes1 オブジェクトを生成し、それを返す。
set_real(list_of_attributes): 　属性のリストを宣言する。学習の間は条件確率はガウス（正規）分布でモデル化される。
import_from_YAML($string): 　YAML に表現された文字列から新しい AI::NaiveBayes1 オブジェクトを生成する。YAML モジュールを必要とする。
import_from_YAML_file($file_name): 　YAML に表現されたファイルから AI::NaiveBayes1 オブジェクトを生成する。YAML モジュールを必要とする。

メソッド

add_instance(attributes=>HASH, label=>STRING|ARRAY): 　カテゴライザへの学習インスタンスを追加。
add_instances(attributes=>HASH, label=>STRING|ARRAY, cases=>NUMBER): 　カテゴライザにインスタンスを識別する番号を追加。
export_to_YAML(): 　AI::NaiveBayes1 オブジェクトの YAML 表現を返す。YAML モジュールを必要とする。
export_to_YAML_file( $file_name ): 　AI::NaiveBayes1 オブジェクトの YAML 表現をファイルに出力する。YAML モジュールを必要とする。
print_model(): 　モデルを人間に可読な表記にして返す。
　モデルは本メソッドを呼び出すこと前に学習されていることが前提とされている。
train(): 　predict() メソッドを使ったカテゴライズに必要な確率を計算する。
predict( attributes => HASH )>: 　未知のインスタンスのラベルを予測するために本メソッドを用いる。属性は add_instance() に渡したものと同じフォーマットでなければならない。predict() は、キーがラベル名で値が対応する確率であるようなハッシュ参照を返す。
labels: 　オブジェクトが知っている全てのラベルのリスト（順不同）またはスカラコンテキストで呼び出された場合はラベルの数を返す。

理論

　ベイズ理論は条件付確率の逆である。すなわち

          P(y|x) P(x)
P(x|y) = -------------
             P(y)

など。

　これは多くの機械学習のテキスト（例えば Witten and Eibe 著 "Data Mining"）で説明されている、ごく標準的なアルゴリズムである。
　アルゴリズムは、A は任意の属性、C はクラス属性であるような P(A|C) を推定することに依存する。A が実数値でなければこの条件付確率は A 及び C のあらゆる可能な値の表を用いて推定される。
　A が実数値であれば確率分布 P(A|C) は C = c であるような正規分布の値でモデル化される。このことから各 C = c に対し A 学習中に平均値 (m) 及び標準偏差 (s) を修正していく。クラス化中は P(A = a | C = c) はガウス分布を用いて推定される。すなわち以下のように計算される。

                   1               (a-m)^2
P(A=a|C=c) = ------------ * exp( - ------- )
             sqrt(2*Pi)*s           2*s^2

　これは以下のようなコードになる。

$scores{$label} *=
   0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
   exp( -0.5 *
      ( ( $newattrs->{$att} -
       $m->{real_stat}{$att}{$label}{mean})
       / $m->{real_stat}{$att}{$label}{stddev}
   ) ** 2
);

すなわち

P(A = a | C = c) = 0.398942280401433 / s * exp( -0.5 * ( ( a-m ) / s ) ** 2 );

使用例

　ガウス分布に基づき実数値で属性を与えた例である。（I. and Frank E. 著の "Data Mining" (the WEKA book), page 86）

# @relation weather
# 
# @attribute outlook {sunny, overcast, rainy}
# @attribute temperature real
# @attribute humidity real
# @attribute windy {TRUE, FALSE}
# @attribute play {yes, no}
# 
# @data
# sunny,85,85,FALSE,no
# sunny,80,90,TRUE,no
# overcast,83,86,FALSE,yes
# rainy,70,96,FALSE,yes
# rainy,68,80,FALSE,yes
# rainy,65,70,TRUE,no
# overcast,64,65,TRUE,yes
# sunny,72,95,FALSE,no
# sunny,69,70,FALSE,yes
# rainy,75,80,FALSE,yes
# sunny,75,70,TRUE,yes
# overcast,72,90,TRUE,yes
# overcast,81,75,FALSE,yes
# rainy,71,91,TRUE,no

$nb->set_real('temperature', 'humidity');

$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>85,humidity=>85,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>80,humidity=>90,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>83,humidity=>86,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>70,humidity=>96,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>68,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>65,humidity=>70,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>64,humidity=>65,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>72,humidity=>95,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>69,humidity=>70,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>75,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>75,humidity=>70,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>72,humidity=>90,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>81,humidity=>75,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>71,humidity=>91,windy=>'TRUE'},label=>'play=no');

$nb->train;

my $printedmodel =  "Model:\n" . $nb->print_model;
my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});

YAML::DumpFile('file', $p);
die unless (abs($p->{'play=no'}  - 0.792) < 0.001);
die unless(abs($p->{'play=yes'} - 0.208) < 0.001);

履歴

　Ken Williams が書いた Algorithms::NaiveBayes は私が欲しかったものとは異なったのでこれを書いた。Algorithms::NaiveBayes はテキストのカテゴライズを意図しており、平滑化や対数確率を含んでいる。本モジュールは汎用で基本的なナイーブ・ベイズアルゴリズムである。

謝辞

　連続変数に対するガウスモデルの実装に関し Yung-chung Lin（xern@ cpan. org）及び以下の人々のバグ報告・サポート・コメントに謝辞を示したい。（年代順に）
Tom Dyson
Dan Von Kohorn
CPAN-testers (jlatour, Jost.Krieger)
Craig Talbert
Andrew Brian Clegg

著者

　2004年に Yung-chung Lin は連続変数に対するガウスモデルの実装を行った。
　このスクリプトは表明されていないがそのまま黙示の保障がされている。
　これはフリーソフトウェアであり、Perl 本体と同等の条件で修正／再配布してもよい。
　モジュールは CPAN（http://search.cpan.org/~vlado）及び http://www.cs.dal.ca/~vlado/srcperl/. で入手可能である。サイトは頻繁に更新される。

参考資料

　Algorithms::NaiveBayes, Perl.

Updated : 2007/10/22