Package org.apache.nlpcraft.model
Interface NCCustomWord
-
public interface NCCustomWord
A partially enriched token with a basic set of NLP properties used by custom NER parser.- See Also:
NCModelView.getParsers()
,NCToken
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description int
getEndCharIndex()
Gets end character index of this word in the original text.String
getLemma()
Gets the lemma of this word, a canonical form of this word.String
getNormalizedText()
Gets normalized text for this word.String
getOriginalText()
Gets original text for this word.String
getPos()
Gets Penn Treebank POS tag for this word.String
getPosDescription()
Gets description of Penn Treebank POS tag.int
getStartCharIndex()
Gets start character index of this word in the original text.String
getStem()
Gets the stem of this word.boolean
isBracketed()
Gets whether or not this word is surrounded by any of'[', ']', '{', '}', '(', ')'
brackets.boolean
isEnglish()
Tests whether the given token represents an English word.boolean
isKnownWord()
Tests whether or not this token is found in Princeton WordNet database.boolean
isQuoted()
Gets whether or not this word is surrounded by single or double quotes.boolean
isStopWord()
Gets whether or not this word is a stopword.boolean
isSwearWord()
Tests whether or not the given token is a swear word.
-
-
-
Method Detail
-
getNormalizedText
String getNormalizedText()
Gets normalized text for this word.- Returns:
- Normalized text.
-
getOriginalText
String getOriginalText()
Gets original text for this word.- Returns:
- Original text.
-
getStartCharIndex
int getStartCharIndex()
Gets start character index of this word in the original text.- Returns:
- Start character index of this word.
-
getEndCharIndex
int getEndCharIndex()
Gets end character index of this word in the original text.- Returns:
- End character index of this word.
-
getPos
String getPos()
Gets Penn Treebank POS tag for this word. Note that additionally to standard Penn Treebank POS tags NLPCraft introduces'---'
synthetic tag to indicate a POS tag for multi-word part. Learn more at http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html- Returns:
- Penn Treebank POS tag for this word.
-
getPosDescription
String getPosDescription()
Gets description of Penn Treebank POS tag. Learn more at http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html- Returns:
- Description of Penn Treebank POS tag.
-
getLemma
String getLemma()
Gets the lemma of this word, a canonical form of this word. Note that stemming and lemmatization allow to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Lemmatization refers to the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Learn more at https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html *- Returns:
- Lemma of this word.
-
getStem
String getStem()
Gets the stem of this word. Note that stemming and lemmatization allow to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Unlike lemma, stemming is a basic heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Learn more at https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html- Returns:
- Stem of this word.
-
isStopWord
boolean isStopWord()
Gets whether or not this word is a stopword. Stopwords are some extremely common words which add little value in helping understanding user input and are excluded from the processing entirely. For example, words likea, the, can, of, about, over
, etc. are typical stopwords in English. NLPCraft has built-in set of stopwords. Each model can also provide its own set of included and excluded stopwords.- Returns:
- Whether or not this word is a stopword.
-
isBracketed
boolean isBracketed()
Gets whether or not this word is surrounded by any of'[', ']', '{', '}', '(', ')'
brackets.- Returns:
- Whether or not this word is surrounded by any of
'[', ']', '{', '}', '(', ')'
brackets.
-
isQuoted
boolean isQuoted()
Gets whether or not this word is surrounded by single or double quotes.- Returns:
- Whether or not this word is surrounded by single or double quotes.
-
isKnownWord
boolean isKnownWord()
Tests whether or not this token is found in Princeton WordNet database.- Returns:
- Princeton WordNet database inclusion flag.
-
isSwearWord
boolean isSwearWord()
Tests whether or not the given token is a swear word. NLPCraft has built-in list of common English swear words.- Returns:
- Swear word flag.
-
isEnglish
boolean isEnglish()
Tests whether the given token represents an English word. Note that this only checks that token's text consists of characters of English alphabet, i.e. the text doesn't have to be necessary a known valid English word.- Returns:
- Whether this token represents an English word.
-
-