7.9.4. TokenFilterStopWord¶
7.9.4.1. Summary¶
TokenFilterStopWord removes stop words from tokenized token
in searching the documents.
TokenFilterStopWord can specify stop word after adding the
documents because it removes token in searching the documents.
The stop word is specified is_stop_word column on lexicon table
when you don’t specify column option.
7.9.4.2. Syntax¶
TokenFilterStopWord has optional parameter:
TokenFilterStopWord
TokenFilterStopWord("column", "ignore")
7.9.4.3. Usage¶
Here is an example that uses TokenFilterStopWord token filter:
Execution example:
plugin_register token_filters/stop_word
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Memos TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Memos content COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStopWord
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms is_stop_word COLUMN_SCALAR Bool
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Terms
[
{"_key": "and", "is_stop_word": true}
]
# [[0, 1337566253.89858, 0.000355720520019531], 1]
load --table Memos
[
{"content": "Hello"},
{"content": "Hello and Good-bye"},
{"content": "Good-bye"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]
select Memos --match_columns content --query "Hello and"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "Hello"
#       ],
#       [
#         2,
#         "Hello and Good-bye"
#       ]
#     ]
#   ]
# ]
and token is marked as stop word in Terms table.
"Hello" that doesn’t have and in content is matched. Because
and is a stop word and and is removed from query.
You can specify stop word in column except is_stop_columns by columns option as below.
Execution example:
plugin_register token_filters/stop_word
table_create Memos TABLE_NO_KEY
column_create Memos content COLUMN_SCALAR ShortText
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters 'TokenFilterStopWord("column", "ignore")'
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
column_create Terms ignore COLUMN_SCALAR Bool
load --table Terms
[
{"_key": "and", "ignore": true}
]
load --table Memos
[
{"content": "Hello"},
{"content": "Hello and Good-bye"},
{"content": "Good-bye"}
]
select Memos --match_columns content --query "Hello and"
# [
#   [
#     0,
#     0.0,
#     0.0
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "Hello"
#       ],
#       [
#         2,
#         "Hello and Good-bye"
#       ]
#     ]
#   ]
# ]