Feature Hashingを試す - Negative/Positive Thinking

はじめに

Feature Hashingについて気になったことがあったので試してみた。

Feature Hashingとは

Hashing trick
ハッシュ関数を使って、素性群をM次元ベクトルにする
- 一種の次元圧縮
Bag of wordsなどの素性をそのままハッシュ値にすることで、素性とIDのペアの辞書などが必要なくなる
- スパムフィルタでは、新語やミススペルでフィルタ回避されてしまうと対応すべき語が増え続ける(辞書が大きくなる)問題などに使える

ベクトルの作り方

いくつか提案されているが、各素性のhash値を計算してmod Mをとったインデクスの所に入れるものとしては主に2つがあるようなので、メモしておく。

Shiらの方法

Shiら(2009)
値をunsigned sumする
φ_i (x) = Σ_{ j:h(j)=i } x_j
- h : ハッシュ関数

Weinbergerらの方法

Weinbergerら(2009)
値をsigned sumする
φ_i (x) = Σ_{ j:h(j)=i } ξ(j) * x_j
- ξ : 1か-1を返すハッシュ関数
こちらの場合は、いくつか条件下で、ハッシュ化したφ(x)とφ(x')の内積の値の期待値は、もとのxとx'の内積の値なる(unbiased)

ちょっとした実験

ハッシュ関数による違いみたいなものがどのぐらいあるのか気になったので、試してみる。
liblinear形式とかの「a:b」のaの数字部分を文字列として扱ってFeatureHashingするみたいなことをしても別に問題ないはずだけど、本当に問題ないのか気になったので、一緒に試してみる。

news20.binaryのデータを使う。
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary

ハッシュ関数は書きやすそうなものとして、naiveな感じのものとFNV-1とmurmurhashで試してみる(深い意味はなし)。
Weinbergerらの方法でベクトルを作る。
Mの値は、2のベキ乗を使った方がよいかもしれないが、50万、10万、5万、1万、5000で見てみる。
liblinear-1.94を使用、オプションは「-v 2」以外全部デフォルト(C=1,L2正則化L2Loss SVC dual)。

$ ./a.out 10000 < news20.binary > news20.binary.10000
$ train -v 2 news20.binary.10000

コード

#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <map>
#include <cmath>

class FeatureHashing {
  std::map<int,std::map<std::string,int> > tbl;

  unsigned int hash(const std::string& str) {
    return hash_murmur(str);
    //return hash_FNV1(str);
    //return hash_naive(str);
  }

  //murmurhash
  // http://en.wikipedia.org/wiki/MurmurHash
  unsigned int hash_murmur(const std::string& str) {
    const char *key = str.c_str();
    unsigned int len = str.length();
     static const unsigned int c1 = 0xcc9e2d51;
     static const unsigned int c2 = 0x1b873593;
     static const unsigned int r1 = 15;
     static const unsigned int r2 = 13;
     static const unsigned int m = 5;
     static const unsigned int n = 0xe6546b64;
   
     unsigned int hash = 123456789U;
   
     const int nblocks = len / 4;
     const unsigned int *blocks = (const unsigned int *) key;
     int i;
     for (i = 0; i < nblocks; i++) {
      unsigned int k = blocks[i];
      k *= c1;
      k = (k << r1) | (k >> (32 - r1));
      k *= c2;
     
      hash ^= k;
      hash = ((hash << r2) | (hash >> (32 - r2))) * m + n;
     }
   
     const unsigned char *tail = (const unsigned char *) (key + nblocks * 4);
     unsigned int k1 = 0;
   
     switch (len & 3) {
     case 3:
      k1 ^= tail[2] << 16;
     case 2:
      k1 ^= tail[1] << 8;
     case 1:
      k1 ^= tail[0];
     
      k1 *= c1;
      k1 = (k1 << r1) | (k1 >> (32 - r1));
      k1 *= c2;
      hash ^= k1;
     }
   
     hash ^= len;
     hash ^= (hash >> 16);
     hash *= 0x85ebca6b;
     hash ^= (hash >> 13);
     hash *= 0xc2b2ae35;
     hash ^= (hash >> 16);
   
     return hash;
  }
 
  //FNV-1
  // http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
  unsigned int hash_FNV1(const std::string& str){
    unsigned int hash;

    hash = 2166136261U;
    for(int i=0; i<str.length(); i++){
      hash = (16777619U * hash) ^ (str[i]);
    }
   
    return hash;
  }
 
  //naive hash
  unsigned int hash_naive(const std::string& str){
    unsigned int h = 0;
    for(int i=0; i<str.length(); i++){
      h = h * 31 + str[i];
    }
    return h;
  }


  //binary hash(string -> {1,-1})
  int binary_hash(const std::string& str){
    unsigned int h = 0;
    for(int i=0; i<str.length(); i++){
      h = h * 37 + str[i];
    }
    return (h % 2 == 1) ? 1 : -1;
  }

public:
  FeatureHashing() { }

  std::vector<double> vectorize(int M, const std::map<std::string,double>& fts){
    std::vector<double> x(M,0);
    for(std::map<std::string,double>::const_iterator i=fts.begin(); i!=fts.end(); ++i){
      unsigned int h = hash(i->first);
      int xi = binary_hash(i->first);
      x[ h % M ] += xi * i->second;
      tbl[ h % M ][i->first] = 1;
    }
    return x;
  }

  void dump(int M){
    int n = 0;
    int zero = 0;
    int sum = 0;
    for(std::map<int, std::map<std::string,int> >::iterator i=tbl.begin(); i!=tbl.end(); ++i){
      //std::cout << (i->second).size() << std::endl;
      n++;
      sum += (i->second).size();
      if((i->second).size() == 0) zero++;
    }
    std::cerr << "N : " << n << std::endl;
    std::cerr << "Zero : " << zero << std::endl;
    std::cerr << "Ave. of hash collision : " << (sum/(double)M) << std::endl;
  }
};


std::vector<std::string> split(const std::string &str, const char c){
  std::vector<std::string> ret;
  std::string tmp = "";
  for(int i=0; i<str.length(); i++){
    if(tmp != "" && str[i] == c){
      ret.push_back(tmp);
      tmp = "";
    }
    else if(str[i] != c){
      tmp += str[i];
    }
  }
  if(tmp != "") ret.push_back(tmp);
  return ret;
}


int main(int argc, char** argv){
  int M = 0;
  if(argc == 2){
    std::stringstream ss(argv[1]);
    ss >> M;
  }
  if(M == 0) return 1;
  std::cerr << "M = " << M << std::endl;



  FeatureHashing fh;
  std::string line;
  int cnt = 0;
  while(std::getline(std::cin, line)){
    std::cerr << "\r" << cnt++;
    std::vector<std::string> sp = split(line, ' ');

    std::map<std::string,double> mp;
    for(int i=1; i<sp.size(); i++){
      double val;
      std::vector<std::string> fsp = split(sp[i], ':');
      std::stringstream ss(fsp[1]);
      ss >> val;
      mp[fsp[0]] = val;
    }

    std::vector<double> fvec = fh.vectorize(M, mp);

    std::cout << sp[0];
    for(int i=0; i<fvec.size(); i++){
      if(fabs(fvec[i])<1e-8) continue;
      std::cout << " " << i+1 << ":" << fvec[i];
    }
    std::cout << std::endl;
  }

  std::cerr << std::endl;
  fh.dump(M);

  return 0;
}

結果

表の一番上は何もしない場合の結果(ベースライン)。

naiveな感じの

Mの大きさ	使ったハッシュ値の種類数	2-fold cv Accuracy
(1,355,191)	-	(95.5791%)
500,000	466,615	95.094%
100,000	100,000	94.5439%
50,000	50,000	93.8188%
10,000	10,000	91.3983%
5,000	5,000	88.7678%

FNV-1

Mの大きさ	使ったハッシュ値の種類数	2-fold cv Accuracy
(1,355,191)	-	(95.5791%)
500,000	467,764	95.4441%
100,000	100,000	94.979%
50,000	50,000	94.3989%
10,000	10,000	91.4183%
5,000	5,000	88.7828%

murmurhash

Mの大きさ	使ったハッシュ値の種類数	2-fold cv Accuracy
(1,355,191)	-	(95.5791%)
500,000	467,088	95.5391%
100,000	100,000	95.059%
50,000	50,000	94.7139%
10,000	10,000	91.6183%
5,000	5,000	88.6777%

(M=50万のとき、使われないハッシュ値ができてる)
次元が1/10ぐらいになってもAccuracyがほとんど減ってない。
ハッシュ関数がかわると大きく変わるかなと思ったけど、そんなに大きな変化はしないみたい。