Lucene3.1 メモその7 – 形態素解析をして日本語のドキュメントを検索する

基本その7，形態素解析をして日本語のドキュメントを検索する

基本その6ではCJKAnalyzerを使って検索をしました。CJKAnalyzerはbi-gramと呼ばれるもので、文章を文脈、品詞とは関係なく２文字ずつのトークンに分けてインデックス化するものです。辞書が必要なく、解析が高速な反面インデックスサイズが大きくなってしまうという欠点があります。

辞書を使って品詞を解析するSenという製品があり、それをLuceneで使えるようにしたJapanezeAnalyzerというものがあるのですが、Senがしばらくメンテナンスされていないためなのか、Senがmaven central repositoryに入っていないためかLucene3.1にはJapanezeAnalyzerが含まれていないようです。

そこでちょっと調べてみたところ、Lucene3.0にJapanezeAnalyzerを対応させたプロジェクトが見つかりました。
・manabu/Lucene-Japanese-Analyzer – GitHub
このプロジェクトではsenの後継版(?)のGoSen というものを使っているようです。
GoSenではSenで辞書作成にperlを使っていたりした部分を全てpure javaで出来るようにしたりと進化を遂げているようです。残念ながらGoSenはmavenizeされていません。

1. GoSenのダウンロード、ビルド
 http://itadaki.svn.sourceforge.net/viewvc/itadaki/GoSen/ より GoSenをダウンロード

$ ant
Buildfile: /Users/yusukey/Downloads/GoSen/build.xml

prepare-directories:

compile:
    [javac] /Users/yusukey/Downloads/GoSen/build.xml:39: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 48 source files to /Users/yusukey/Downloads/GoSen/bin

jar:
      [jar] Building jar: /Users/yusukey/Downloads/GoSen/gosen-1.0beta.jar

2. GoSenをローカルリポジトリに登録

$ mvn install:install-file -DgroupId=gosen -DartifactId=gosen -Dversion=1.0beta -Dpackaging=jar -Dfile=/Users/yusukey/Downloads/GoSen/gosen-1.0beta.jar 
[INFO] Scanning for projects...
[INFO] Searching repository for plugin with prefix: 'install'.
[INFO] ------------------------------------------------------------------------
[INFO] Building Unnamed - com.github.lucenejapaneseanalyzer:japaneseanalyzer:jar:0.0.1-SNAPSHOT
[INFO]    task-segment: [install:install-file] (aggregator-style)
[INFO] ------------------------------------------------------------------------
[INFO] [install:install-file {execution: default-cli}]
[INFO] Installing /Users/yusukey/Downloads/GoSen/gosen-1.0beta.jar to /Users/yusukey/.m2/repository/gosen/gosen/1.0beta/gosen-1.0beta.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: < 1 second
[INFO] Finished at: Wed Apr 06 13:14:02 JST 2011
[INFO] Final Memory: 4M/81M
[INFO] ------------------------------------------------------------------------

3. 辞書を作成して lucene-examples プロジェクトにコピー

$ cd testdata/dictionary/
$ ant
Buildfile: /Users/yusukey/Downloads/GoSen/testdata/dictionary/build.xml

prepare-proxy:

check-build-status:

download:
      [get] Getting: http://chasen.aist-nara.ac.jp/stable/ipadic/ipadic-2.6.0.tar.gz
      [get] To: /Users/yusukey/Downloads/GoSen/testdata/dictionary/ipadic-2.6.0.tar.gz

unpack:
   [gunzip] Expanding /Users/yusukey/Downloads/GoSen/testdata/dictionary/ipadic-2.6.0.tar.gz to /Users/yusukey/Downloads/GoSen/testdata/dictionary/ipadic-2.6.0.tar
    [untar] Expanding: /Users/yusukey/Downloads/GoSen/testdata/dictionary/ipadic-2.6.0.tar into /Users/yusukey/Downloads/GoSen/testdata/dictionary
   [delete] Deleting: /Users/yusukey/Downloads/GoSen/testdata/dictionary/ipadic-2.6.0.tar

preprocess:

compile:

BUILD SUCCESSFUL
Total time: 23 seconds

$ cd ..
$ cp dictionary ~/lucene-examples

4. Lucene-Japanese-Analyzer をダウンロードしてビルド、インストール

$ git clone git@github.com:yusuke/Lucene-Japanese-Analyzer.git
$ cd Lucene-Japanese-Analyzer
$  mvn clean install -Dmaven.test.skip=true

ちょっと難しいですがこれでJapaneseAnalyzerを使う準備ができました。
「記者」をちゃんと検索できることが以下のテストケースで確認できます。

    @Test
    public void index() throws Exception {
        System.setProperty("org.apache.lucene.ja.config.file","japanese-gosen-analyzer.xml");
        System.setProperty("sen.home","dictionary");
        
        Directory directory = new RAMDirectory();
//        Directory directory = FSDirectory.open(new File("gosenindex"));
        Analyzer analyzer = new GoSenAnalyzer();

        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31, analyzer);
        IndexWriter writer = new IndexWriter(directory, iwc);

        Document doc = new Document();
        doc.add(new Field("str_field", "quick brown fox jumped over the lazy dog.",
                Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc);
        Document doc2 = new Document();
        doc2.add(new Field("str_field", "貴社の記者が汽車で帰社した",
                Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(doc2);
        writer.close();
        IndexSearcher searcher = new IndexSearcher(directory, true);
        QueryParser parser = new QueryParser(Version.LUCENE_31, "str_field", analyzer);
        TopDocs td = searcher.search(parser.parse("記者"), 1000);
        assertThat(td.totalHits, is(1));
        searcher.close();
        directory.close();
    }

すぐ試せるmavenベースのソースコードはここに置いてあります。
Lucene3.1、JUnit4.8を使っています。

実行結果はメモその6と変わりませんが、形態素解析をしている分インデックスのサイズがコンパクトになっているはずです。インデックスの中身はLukeというツールを使って確認できるのですが、LukeがまだLucene3.1に対応していないようで、インデックスファイルを開けなかったので今回は見送りです。
・Issue 35 – luke – Lucene 3.1 compatible luke version out of the box – Luke – Lucene Index Toolbox – Google Project Hosting

FSDirectoryを使ってファイルにインデックスを書き出した際のサイズをCJKAnalyzerとJapaneseAnalyzerで比べたところ、若干JapanezeAnalyzerの方が小さいことが確認できました。

ちなみに形態素解析ライブラリは商用の製品もあって、Basis Technologyという会社からLucene対応のモジュールが販売されています。一般にあまり聞かない会社かもしれませんが、言語処理ソフトウェアでは恐らく業界シェアNo.1です。
・Lucene用Rosette言語処理プラットフォーム

また、JapanezeAnalyzerと同じくMeCab(Senのベースとなっているソフト)ベースのkuromojiというプロジェクトもあります。こちらはmavenizeされているようなのでJapanezeAnalyzerよりも簡単に使えるかもしれません。

[amazonjs asin=”4774127809″ locale=”JP” tmpl=”Small”]

[amazonjs asin=”4774141755″ locale=”JP” tmpl=”Small”]

yusuke.blog

プログラムと、ゲームと、諸々と

Lucene3.1 メモその7 – 形態素解析をして日本語のドキュメントを検索する

関連