liblinearで文書分類を試す
はじめに
データ整形やスケール調整、パラメータの探索を行うことでどれだけ変わるか気になったので、liblinearを使って文書分類を試してみる。
liblinear
- http://www.csie.ntu.edu.tw/~cjlin/liblinear/
- version 1.93を利用
使用するデータ
- http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
- 「news20」を使用する
- 20クラス
- 学習:15935データ、テスト:3993データ
- 素性数:学習62061、テスト62060
- news20.bz2とnews20.t.bz2は、単語IDとTF値のペアっぽい
#学習データの各クラスのドキュメント数 $ cut -f1 -d" " news20 | sort |uniq -c | sort -k2 -n 797 1 780 2 797 3 796 4 796 5 797 6 799 7 798 8 799 9 792 10 798 11 797 12 798 13 798 14 799 15 799 16 800 17 799 18 799 19 797 20
文書分類
以下を試してみる。
- そのままやってみる(TF値)
- 素性をbinary値にしてみる
- 素性をTF-IDF値にしてみる
- Normalizationしてみる(instance-wise normalization)
- Scaleを合わせてみる(feature-wise normalization)
- gridサーチして最適なパラメータを探してみる
1.そのままやってみる(TF値)
何も考えずに「123:3」のような単語の頻度を素性にしたもので分類してみる。
$ train -s 0 news20 $ predict news20.t news20.model result Accuracy = 84.2224% (3363/3993) # デフォルト設定の学習結果 $ train -s 1 news20 $ predict news20.t news20.model result Accuracy = 82.2189% (3283/3993) $ train -s 2 news20 $ predict news20.t news20.model result Accuracy = 82.6947% (3302/3993) $ train -s 3 news20 $ predict news20.t news20.model result Accuracy = 82.1187% (3279/3993) $ train -s 4 news20 $ predict news20.t news20.model result Accuracy = 80.6411% (3220/3993) $ train -s 5 news20 $ predict news20.t news20.model result Accuracy = 80.0902% (3198/3993) $ train -s 6 news20 $ predict news20.t news20.model result Accuracy = 81.4676% (3253/3993) $ train -s 7 news20 $ predict news20.t news20.model result Accuracy = 84.2725% (3365/3993)
2.素性をbinary値にしてみる
TFではなく、出現したか否かを0と1とする素性に直してやってみる。
「123:3」となっているところを「123:1」のようにする。
$ train -s 0 news20.binary $ predict news20.t.binary news20.binary.model result Accuracy = 84.4979% (3374/3993) $ train -s 1 news20.binary $ predict news20.t.binary news20.binary.model result Accuracy = 82.9952% (3314/3993) $ train -s 2 news20.binary $ predict news20.t.binary news20.binary.model result Accuracy = 83.421% (3331/3993) $ train -s 3 news20.binary $ predict news20.t.binary news20.binary.model result Accuracy = 82.5445% (3296/3993) $ train -s 4 news20.binary $ predict news20.t.binary news20.binary.model result Accuracy = 81.0919% (3238/3993) $ train -s 5 news20.binary $ predict news20.t.binary news20.binary.model result Accuracy = 79.99% (3194/3993) $ train -s 6 news20.binary $ predict news20.t.binary news20.binary.model result Accuracy = 80.8916% (3230/3993) $ train -s 7 news20.binary $ predict news20.t.binary news20.binary.model result Accuracy = 84.573% (3377/3993)
3.素性をTF-IDF値にしてみる
学習データでのDF値を使って「123:3」となってるところを「123:0.24344」のようにTFIDF値にしてみる。
TF=(そのドキュメントでのその単語の出現頻度)/(そのドキュメントの総出現単語数)
IDF=log( (ドキュメント数)/(その単語が出現したドキュメント数+1) )
TFIDF=TF*IDF
※IDFの+1は未知語があるようなので、計算の都合
$ train -s 0 news20.tfidf $ predict news20.t.tfidf news20.tfidf.model result Accuracy = 85.3243% (3407/3993) $ train -s 1 news20.tfidf $ predict news20.t.tfidf news20.tfidf.model result Accuracy = 86.9021% (3470/3993) $ train -s 2 news20.tfidf $ predict news20.t.tfidf news20.tfidf.model result Accuracy = 86.9021% (3470/3993) $ train -s 3 news20.tfidf $ predict news20.t.tfidf news20.tfidf.model result Accuracy = 86.5765% (3457/3993) $ train -s 4 news20.tfidf $ predict news20.t.tfidf news20.tfidf.model result Accuracy = 86.7017% (3462/3993) $ train -s 5 news20.tfidf $ predict news20.t.tfidf news20.tfidf.model result Accuracy = 83.3208% (3327/3993) $ train -s 6 news20.tfidf $ predict news20.t.tfidf news20.tfidf.model result Accuracy = 77.6359% (3100/3993) $ train -s 7 news20.tfidf $ predict news20.t.tfidf news20.tfidf.model result Accuracy = 85.3994% (3410/3993)
4.Normalizationしてみる(instance-wise normalization)
各ドキュメント(各行)について、単位ベクトルになおしてみる(ベクトルの大きさで割る)。
【4-1.TF.normalize】
$ train -s 0 news20.normalize $ predict news20.t.normalize news20.normalize.model result Accuracy = 83.2206% (3323/3993) $ train -s 1 news20.normalize $ predict news20.t.normalize news20.normalize.model result Accuracy = 85.6248% (3419/3993) $ train -s 2 news20.normalize $ predict news20.t.normalize news20.normalize.model result Accuracy = 85.5497% (3416/3993) $ train -s 3 news20.normalize $ predict news20.t.normalize news20.normalize.model result Accuracy = 85.4495% (3412/3993) $ train -s 4 news20.normalize $ predict news20.t.normalize news20.normalize.model result Accuracy = 85.4245% (3411/3993) $ train -s 5 news20.normalize $ predict news20.t.normalize news20.normalize.model result Accuracy = 82.4192% (3291/3993) $ train -s 6 news20.normalize $ predict news20.t.normalize news20.normalize.model result Accuracy = 75.8578% (3029/3993) $ train -s 7 news20.normalize $ predict news20.t.normalize news20.normalize.model result Accuracy = 83.3208% (3327/3993)
【4-2.binary.normalize】
$ train -s 0 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 82.9201% (3311/3993) $ train -s 1 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 85.3744% (3409/3993) $ train -s 2 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 85.3744% (3409/3993) $ train -s 3 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 85.124% (3399/3993) $ train -s 4 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 85.2492% (3404/3993) $ train -s 5 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 82.0436% (3276/3993) $ train -s 6 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 75.6824% (3022/3993) $ train -s 7 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 82.9201% (3311/3993)
【4-3.TFIDF.normalize】
$ train -s 0 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 85.3243% (3407/3993) $ train -s 1 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 87.0523% (3476/3993) $ train -s 2 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 87.0523% (3476/3993) $ train -s 3 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 87.1275% (3479/3993) $ train -s 4 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 86.9522% (3472/3993) $ train -s 5 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 83.2206% (3323/3993) $ train -s 6 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 77.0348% (3076/3993) $ train -s 7 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 85.3243% (3407/3993)
5.Scaleを合わせてみる(feature-wise normalization)
svm-scale(libsvm-3.17同梱)を使って、素性のスケールを合わせてみる。
$ svm-scale -l 0 -u 1 -s scale_params news20 > news20.scale $ svm-scale -r scale_params news20.t > news20.t.scale
【5-1.TF.scale】
$ train -s 0 news20.scale $ predict news20.t.scale news20.scale.model result Accuracy = 83.5963% (3338/3993) $ train -s 1 news20.scale $ predict news20.t.scale news20.scale.model result Accuracy = 81.718% (3263/3993) $ train -s 2 news20.scale $ predict news20.t.scale news20.scale.model result Accuracy = 81.4425% (3252/3993) $ train -s 3 news20.scale $ predict news20.t.scale news20.scale.model result Accuracy = 81.4676% (3253/3993) $ train -s 4 news20.scale $ predict news20.t.scale news20.scale.model result Accuracy = 79.1635% (3161/3993) $ train -s 5 news20.scale $ predict news20.t.scale news20.scale.model result Accuracy = 79.6143% (3179/3993) $ train -s 6 news20.scale $ predict news20.t.scale news20.scale.model result Accuracy = 76.8345% (3068/3993) $ train -s 7 news20.scale $ predict news20.t.scale news20.scale.model result Accuracy = 83.6213% (3339/3993)
【5-2.binary.scale】
scale済みなので、省略。
【5-3.TFIDF.scale】
$ train -s 0 news20.tfidf.scale $ predict news20.t.tfidf.scale news20.tfidf.scale.model result Accuracy = 83.972% (3353/3993) $ train -s 1 news20.tfidf.scale $ predict news20.t.tfidf.scale news20.tfidf.scale.model result Accuracy = 83.0954% (3318/3993) $ train -s 2 news20.tfidf.scale $ predict news20.t.tfidf.scale news20.tfidf.scale.model result Accuracy = 83.1205% (3319/3993) $ train -s 3 news20.tfidf.scale $ predict news20.t.tfidf.scale news20.tfidf.scale.model result Accuracy = 82.7448% (3304/3993) $ train -s 4 news20.tfidf.scale $ predict news20.t.tfidf.scale news20.tfidf.scale.model result Accuracy = 83.0453% (3316/3993) $ train -s 5 news20.tfidf.scale $ predict news20.t.tfidf.scale news20.tfidf.scale.model result Accuracy = 82.1187% (3279/3993) $ train -s 6 news20.tfidf.scale $ predict news20.t.tfidf.scale news20.tfidf.scale.model result Accuracy = 81.0168% (3235/3993) $ train -s 7 news20.tfidf.scale $ predict news20.t.tfidf.scale news20.tfidf.scale.model result Accuracy = 83.972% (3353/3993)
6.gridサーチして最適なパラメータを探してみる
パラメータCの値を変えて、trainデータの5-fold cross validation accuracyで一番いいものを使ってみる。
Cは、0.125,0.25,0.5,1,2,4,8で試してみる。
データは、instance-wise正規化のbinaryとtfidfのでやってみる。
【6-1.binary.normalize】
$ train -c 8 -s 0 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 84.9236% (3391/3993) $ train -c 1 -s 1 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 85.3744% (3409/3993) $ train -c 1 -s 2 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 85.3744% (3409/3993) $ train -c 2 -s 3 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 85.4245% (3411/3993) $ train -c 0.5 -s 4 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 85.4245% (3411/3993) $ train -c 2 -s 5 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 82.4443% (3292/3993) $ train -c 8 -s 6 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 82.6446% (3300/3993) $ train -c 8 -s 7 news20.binary.normalize $ predict news20.t.binary.normalize news20.binary.normalize.model result Accuracy = 84.9487% (3392/3993)
【6-2.tfidf.normalize】
$ train -c 8 -s 0 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 86.7769% (3465/3993) $ train -c 0.5 -s 1 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 87.2527% (3484/3993) $ train -c 0.5 -s 2 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 87.2777% (3485/3993) $ train -c 1 -s 3 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 87.1275% (3479/3993) $ train -c 0.5 -s 4 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 87.0273% (3475/3993) $ train -c 2 -s 5 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 83.0954% (3318/3993) $ train -c 8 -s 6 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 82.9952% (3314/3993) $ train -c 8 -s 7 news20.tfidf.normalize $ predict news20.t.tfidf.normalize news20.tfidf.normalize.model result Accuracy = 86.7518% (3464/3993)
データをそのまま&デフォルト設定の分類器でやると82.2189%ぐらいだけど、
データ調整、スケールの調整、パラメータ選択で87.2777%と+5ptぐらいあがった。
参考
- http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf
- A Practical Guide to LIBLINEAR (Appendix L.)
- http://d.hatena.ne.jp/sleepy_yoshi/20120624/
- http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf