ElasticSearchイメージでサクッと全文検索してみる！！

ElasticSearch環境構築

環境
- Windows10Pro
- Docker for Windows
- Hyper-V有効化を忘れずに！

#elasticsearchイメージをダウンロード
docker pull elasticsearch
# コンテナ作成・起動
docker run -it -d --name elastic_search -p 9200:9200 -p 9300:9300 -v C:¥Users¥buto¥share:/mnt/share elasticsearch

全文検索に使うプラグインをインストール

# コンテナに入る
docker exec -it elastic_search /bin/bash

###### ここからコンテナ環境 ######

# ログイン直後はrootユーザー
su - elasticsearch
# elasticsearch-pluginコマンドでプラグイン操作
elasticsearch-plugin install kuromoji # 日本語対応
elasticsearch-plugin install analysis-icu # 文字の解析器
elasticsearch-plugin install ingest-attachment # ESにファイルを取り込む
# プラグインが追加された
elasticsearch-plugin list
# analysis-icu
# analysis-kuromoji
# ingest-attachment

プラグインをElasticSearchに反映させるためにコンテナ再起動（Docker for Windows GUIから実行） ※ コンテナ内でsudo systemctl elasticsearch restartできなかった。。。

PDFを取り込むパイプラインを作成

以下のリクエストでPDFをElasticSearchに流すパイプラインを作成する

curl --location --request PUT 'http://localhost:9200/_ingest/pipeline/attachment'
--header 'Content-type: application/json'
--data-raw '{
  "description" : "sample mapping",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : -1,
        "properties" : [
         "content",
         "content_type"
        ]
      }
    }
  ]
}'

attachmentという名前のパイプラインができる！

文字解析器を作成

PDFの文章を２～３文字ずつ区切って解析するよう設定

curl --location --request PUT 'http://localhost:9200/ngram_1'
--header 'Content-type: application/json'
--data-raw '{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_ja_analyzer": {
                    "type": "custom",
                    "tokenizer": "my_tokenizer",
                    "char_filter": [
                        "icu_normalizer",
                        "kuromoji_iteration_mark"
                    ],
                    "filter": [
                        "kuromoji_baseform",
                        "kuromoji_part_of_speech",
                        "ja_stop",
                        "kuromoji_number",
                        "kuromoji_stemmer"
                    ]
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "ngram",
                    "min_gram": 2,
                    "max_gram": 3,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "attachment.content": {
                "type": "text",
                "analyzer": "my_ja_analyzer",
                "search_analyzer": "my_ja_analyzer"
            }
        }
    }
}'

ngram_1という名前の解析器が作成される

PDFデータをBase64 エンコード

HTTPリクエストボディでPDFをパイプラインに送るため、PDFファイルをBase64 エンコードする ElasticSearchのVM（Linux）で以下のコマンドを実行

base64 -w 0 sample.pdf > sample.txt

sample.txtにBase64 エンコードした内容が出力される

PDFをパイプラインに投入

curl --location --request POST 'http://localhost:9200/ngram_1/_doc/1?pipeline=attachment'
--header 'Content-type: application/json'
--data-raw '{
    "index": {
        "_index": "ngram",
        "_type": "pdffile",
        "_id": "1",
        "pipeline": "attachment"
    },
    "data": "sample.txtの中身"
}'

これでElasticSearchにPDFの文章が取り込まれる！

文字列検索してみる

GETリクエストのqパラメータに検索したい文字列をセットする

curl --location --request GET 'http://localhost:9200/ngram_1/_search?q=検索ワード'

ヒットするとこんなレスポンスが返ってきます

{
    "took": 34,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 2.7496266,
        "hits": [
            {
                "_index": "ngram_1",
                "_type": "_doc",
                "_id": "1",
                "_score": 2.7496266,
                "_source": {
                    "data": "sample.txtの中身",
                    "attachment": {
                        "content_type": "application/pdf",
                        "content": "sample.txtをBase64デコードした内容"
                    },
                    "index": {
                        "pipeline": "attachment",
                        "_index": "ngram",
                        "_type": "pdffile",
                        "_id": "1"
                    }
                }
            }
        ]
    }
}

ヒットしなかった場合はこんなレスポンス

{

    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

やってみて

ElasticSearchのDockerイメージがあったので構築は楽でした！ PDFを取り込むためのパイプライン、解析器の作成などが理解できず時間が掛かりましたこの設定で大体の日本語はヒットするのですが、ヒットするべきでないワードでもヒットしてしまうことが多々ありました… AWSのElasticsearchServiceもありますが、あちらはanalysis-icuがサポートされていなかったので、ヒットしてほしい時にヒットできないことがありました（AWSの方はプラグイン追加はできない）全文検索をちょっと試してみたい時にどうぞ！！

buto > /dev/null

だいたい急に挑戦してゴールにたどり着かずに飽きる日々です

Docker ElasticSearchでPDF全文検索

ElasticSearch環境構築

全文検索に使うプラグインをインストール

PDFを取り込むパイプラインを作成

文字解析器を作成

PDFデータをBase64 エンコード

PDFをパイプラインに投入

文字列検索してみる

やってみて

ElasticSearch環境構築

全文検索に使うプラグインをインストール

PDFを取り込むパイプラインを作成

文字解析器を作成

PDFデータをBase64エンコード

PDFをパイプラインに投入

文字列検索してみる

やってみて

PDFデータをBase64 エンコード