Elasticsearch term & match

내가 공부하고 싶은 IT/지식정리 2023. 1. 1. 21:01

테스트용 index 생성

schema 를 가진 인덱스 생성 (nori 형태소 분석기 사용)

PUT /cn-test-idx
{
  "settings": {
    "index": {
      "number_of_replicas": 1,
      "number_of_shards": 1,
      "analysis": {
        "analyzer": {
          "korean_analyzer": {
            "type": "custom",
            "tokenizer": "korean_tokenizer",
            "filter": [
              "nori_readingform",
              "lowercase",
              "nori_posfilter"
            ]
          }
        },
        "tokenizer": {
          "korean_tokenizer": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed"
          }
        },
        "filter": {
          "nori_posfilter": {
            "type": "nori_part_of_speech",
            "stoptags": [
              "E",
              "IC",
              "J",
              "MAG",
              "MAJ",
              "SP",
              "SSC",
              "SSO",
              "SC",
              "SE",
              "XPN",
              "XSA",
              "XSN",
              "XSV",
              "UNA",
              "NA",
              "VCP",
              "VSV",
              "VX",
              "VV"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "keyword_field": {
          "type": "keyword"
        },
        "kor_contents": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          },
          "analyzer": "korean_analyzer"
        },
        "contents": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

테스트용 document 등록

PUT /cn-test-idx/_doc/1
{
  "contents" : "여러개의 물건들",
  "kor_contents" : "여러개의 물건들",
  "keyword_field": "여러개의 물건들"
}

테스트용 텍스트

GET /cn-test-idx/_analyze?pretty
{
  "analyzer": "korean_analyzer",
  "text": "여러개의 물건들"
}

{
  "tokens" : [
    {
      "token" : "여러",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "개",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "물건",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    }
  ]
}

테스트용 text가 “여러개의 물건들” 이고 nori 분석기로 analyze 하면 위와 같이 tokenize 된다.

토큰들: 여러, 개, 물건

term 쿼리

해당 필드(여기서는 kor_contents)의 inverted index 에 저장되는 token 들 중에서 쿼리의 키워드와 일치하는 document가 있는지 찾아준다.
보통 filter 용도로 사용한다.
아래와 같이 3가지 쿼리가 가능하다.

sql 쿼리의 where 문의 field = 'token' 과 유사하다고 보면 된다.

GET /cn-test-idx/_search
{
  "query": {
    "term": {
      "kor_contents": "여러"
    }
  }
}

GET /cn-test-idx/_search
{
  "query": {
    "term": {
      "kor_contents": "개"
    }
  }
}

GET /cn-test-idx/_search
{
  "query": {
    "term": {
      "kor_contents": "물건"
    }
  }
}

terms 쿼리

term의 경우, 질의문이 1개만 가능하지만 terms 는 여려 개의 질의문을 사용할 수 있다.

GET /cn-test-idx/_search
{
  "query": {
    "terms": {
      "kor_contents": [
        "여러",
        "개"
      ]
    }
  }
}

match 쿼리

term과 마찬가지로 inverted index 에 저장되는 token 들 중에서 일치하는 document가 있는지 찾아주는데, 차이점은 바로 검색하는 키워드를 analyze 한다는 것이다. 이 analyze 한 결과의 token 들 중에서 하나라도 일치하면 결과에 포함된다.

GET /cn-test-idx/_search
{
  "query": {
    "match": {
      "kor_contents": "여러가지"
    }
  }
}

검색하는 키워드는 ‘여러’, ‘가지’ 두개의 토큰으로 분석되고 이중에 ‘여러’ 가 일치하기 때문에 조회 결과에 테스트 document 가 나온다.

GET /cn-test-idx/_analyze?pretty
{
  "analyzer": "korean_analyzer",
  "text": "여러가지"
}

{
  "tokens" : [
    {
      "token" : "여러",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "가지",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    }
  ]
}

multi_match 쿼리

여러필드에 단일 질의문을 검색할 때 사용합니다.

GET /cn-test-idx/_search
{
  "query": {
    "multi_match": {
      "fields": [
        "kor_contents",
        "contents"
      ],
      "query": "여러가지"
    }
  }
}

match_phrase 쿼리

phrase: 둘 또는 그 이상의 어절로 이루어져 한 덩어리
match 와 다른 점은 검색 키워드의 분석된 토큰이 모두 존재해야 하고 순서도 순차적으로 동일한 document를 결과에 포함한다.

가능 쿼리

GET /cn-test-idx/_search
{
  "query": {
    "match_phrase": {
      "kor_contents": "여러개"
    }
  }
}

GET /cn-test-idx/_search
{
  "query": {
    "match_phrase": {
      "kor_contents": "여러개의 물건들"
    }
  }
}

불가능 쿼리

GET /cn-test-idx/_search
{
  "query": {
    "match_phrase": {
      "kor_contents": "물건들 여러개"
    }
  }
}

match 와 match_phrase 중 서비스에 적용시에는 match_phrase 를 써야될 거 같긴 한데 위 예의 순서도 맞아야 하는 부분이 걸림돌이다.

그래서 실제 검색에서는 아래와 같이 사용 예정이다.

GET /cn-test-idx/_search
{
  "query": {
    "match": {
      "kor_contents": {
        "query": "물건들 여러개",
        "operator": "and",
        "boost": 1
      }
    }
  }
}

match 와 operator의 and 조건을 써서 검색 키워드의 모든 토큰이 존재하나 순서는 상관없는 쿼리를 작성할 수 있다.

'내가 공부하고 싶은 IT > 지식정리' 카테고리의 다른 글

100만 row가 있는 테이블에 컬럼 추가는 어떻게 할까 (0)	2023.03.28
2022년 회고 (1)	2023.01.30
Elasticsearch bool 쿼리 (0)	2023.01.01
Elasticsearch DSL 기본 (0)	2023.01.01
Elasticsearch Inverted Index 의 이해 (0)	2022.12.25

ABOUT ME

편해지기 위한 도전 편해지기 위한 도전

테스트용 index 생성

테스트용 텍스트

term 쿼리

terms 쿼리

match 쿼리

multi_match 쿼리

match_phrase 쿼리

'내가 공부하고 싶은 IT > 지식정리' 카테고리의 다른 글

티스토리툴바

ABOUT ME

테스트용 index 생성

테스트용 텍스트

term 쿼리

terms 쿼리

match 쿼리

multi_match 쿼리

match_phrase 쿼리

'내가 공부하고 싶은 IT > 지식정리' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바