Interface NCCustomWord


  • public interface NCCustomWord
    A partially enriched token with a basic set of NLP properties used by custom NER parser.
    See Also:
    NCModelView.getParsers(), NCToken
    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      int getEndCharIndex()
      Gets end character index of this word in the original text.
      String getLemma()
      Gets the lemma of this word, a canonical form of this word.
      String getNormalizedText()
      Gets normalized text for this word.
      String getOriginalText()
      Gets original text for this word.
      String getPos()
      Gets Penn Treebank POS tag for this word.
      String getPosDescription()
      Gets description of Penn Treebank POS tag.
      int getStartCharIndex()
      Gets start character index of this word in the original text.
      String getStem()
      Gets the stem of this word.
      boolean isBracketed()
      Gets whether or not this word is surrounded by any of '[', ']', '{', '}', '(', ')' brackets.
      boolean isEnglish()
      Tests whether the given token represents an English word.
      boolean isKnownWord()
      Tests whether or not this token is found in Princeton WordNet database.
      boolean isQuoted()
      Gets whether or not this word is surrounded by single or double quotes.
      boolean isStopWord()
      Gets whether or not this word is a stopword.
      boolean isSwearWord()
      Tests whether or not the given token is a swear word.
    • Method Detail

      • getNormalizedText

        String getNormalizedText()
        Gets normalized text for this word.
        Returns:
        Normalized text.
      • getOriginalText

        String getOriginalText()
        Gets original text for this word.
        Returns:
        Original text.
      • getStartCharIndex

        int getStartCharIndex()
        Gets start character index of this word in the original text.
        Returns:
        Start character index of this word.
      • getEndCharIndex

        int getEndCharIndex()
        Gets end character index of this word in the original text.
        Returns:
        End character index of this word.
      • getLemma

        String getLemma()
        Gets the lemma of this word, a canonical form of this word. Note that stemming and lemmatization allow to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Lemmatization refers to the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Learn more at https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html *
        Returns:
        Lemma of this word.
      • getStem

        String getStem()
        Gets the stem of this word. Note that stemming and lemmatization allow to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Unlike lemma, stemming is a basic heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Learn more at https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
        Returns:
        Stem of this word.
      • isStopWord

        boolean isStopWord()
        Gets whether or not this word is a stopword. Stopwords are some extremely common words which add little value in helping understanding user input and are excluded from the processing entirely. For example, words like a, the, can, of, about, over, etc. are typical stopwords in English. NLPCraft has built-in set of stopwords. Each model can also provide its own set of included and excluded stopwords.
        Returns:
        Whether or not this word is a stopword.
      • isBracketed

        boolean isBracketed()
        Gets whether or not this word is surrounded by any of '[', ']', '{', '}', '(', ')' brackets.
        Returns:
        Whether or not this word is surrounded by any of '[', ']', '{', '}', '(', ')' brackets.
      • isQuoted

        boolean isQuoted()
        Gets whether or not this word is surrounded by single or double quotes.
        Returns:
        Whether or not this word is surrounded by single or double quotes.
      • isKnownWord

        boolean isKnownWord()
        Tests whether or not this token is found in Princeton WordNet database.
        Returns:
        Princeton WordNet database inclusion flag.
      • isSwearWord

        boolean isSwearWord()
        Tests whether or not the given token is a swear word. NLPCraft has built-in list of common English swear words.
        Returns:
        Swear word flag.
      • isEnglish

        boolean isEnglish()
        Tests whether the given token represents an English word. Note that this only checks that token's text consists of characters of English alphabet, i.e. the text doesn't have to be necessary a known valid English word.
        Returns:
        Whether this token represents an English word.