codesimian
Class NaturalLanguage

java.lang.Object
  extended by codesimian.NaturalLanguage

public class NaturalLanguage
extends java.lang.Object

Converts strings of natural language, like this sentence, to arrays of numbers so computers can understand them better.


Nested Class Summary
static class NaturalLanguage.RemoveEverythingExceptLetters
           
 
Field Summary
protected  java.lang.String delimiters
           
protected  java.util.Vector<java.lang.String> exampleText
           
protected  java.util.HashMap<java.lang.String,java.lang.String[]> similarStrings
          Strings that can often be used interchangably, like "3" with "e" in "b33r" or "beer" Also bigger strings like "y" with "ies" in "penny" or "pennies".
protected  java.lang.String[] word
          most common words have lowest indexs
protected  int[] wordsFound
           
protected  java.util.HashMap<java.lang.String,java.lang.Integer> wordToIndex
          key is a word.
 
Constructor Summary
NaturalLanguage(int wordsToUse)
          You must addExampleText() some sentences for me to use in my calculations.
NaturalLanguage(java.lang.String sentences, int wordsToUse)
          Same as NaturalLanguage(int) except calls addExampleText(sentences)
 
Method Summary
 void addExampleText(java.lang.String t)
           
protected  void addSimilarStrings(java.lang.String[] similarToEachOther)
          Adds all to similarStrings, each as its own key, and the array is the value.
 void calculateNewMostCommonWords()
          fills word[] (and wordToIndex) with the most common formatted words from exampleText.
 java.lang.String formatWord(java.lang.String word)
          Formats a word in a standard way.
 java.lang.String getDelimiters()
           
 int getIndex(java.lang.String theWord)
          returns the index of a wordwordToIndex.get(word).intValue().
 java.lang.String getWord(int index)
          returns word[index].
 boolean setDelimiters(java.lang.String delimiters)
           
 int[] tokenize(java.lang.String sentences)
          Returns a sequence of ints representing the most common words (from the other main text) in 'sentences'.
 int wordCount()
          Returns how many unique words are recognized, word.length.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordToIndex

protected java.util.HashMap<java.lang.String,java.lang.Integer> wordToIndex
key is a word. value is an Integer index for indexToWord.
word[ wordToIndex.get("is").intValue() ] equals "is".


word

protected java.lang.String[] word
most common words have lowest indexs


wordsFound

protected int[] wordsFound

delimiters

protected java.lang.String delimiters

similarStrings

protected java.util.HashMap<java.lang.String,java.lang.String[]> similarStrings
Strings that can often be used interchangably, like "3" with "e" in "b33r" or "beer" Also bigger strings like "y" with "ies" in "penny" or "pennies". Anywhere one string is found, an AI might think about what the text would mean if parts were replaced by a similar string. To cut down on the number of words, all similar words will be thought of as the same word.

Key is some string that is a key or value in similarStrings.
Value is a String[] array containing the key you searched for and all strings similar to it.


exampleText

protected java.util.Vector<java.lang.String> exampleText
Constructor Detail

NaturalLanguage

public NaturalLanguage(int wordsToUse)
You must addExampleText() some sentences for me to use in my calculations.

Parameters:
wordsToUse - number of words that are recognized. All other words are ignored and never returned in a sequence

NaturalLanguage

public NaturalLanguage(java.lang.String sentences,
                       int wordsToUse)
Same as NaturalLanguage(int) except calls addExampleText(sentences)

Method Detail

getIndex

public int getIndex(java.lang.String theWord)
returns the index of a wordwordToIndex.get(word).intValue(). If the word is not already formatted, call getIndex(formatWord(word)) instead. The most common word is 0, second most common is index 1... Returns -1 if word does not exist therefore has no index.


getWord

public java.lang.String getWord(int index)
returns word[index]. The most common word is 0, second most common is index 1...


wordCount

public int wordCount()
Returns how many unique words are recognized, word.length. Indexs range 0 to wordCount()-1.


setDelimiters

public boolean setDelimiters(java.lang.String delimiters)

getDelimiters

public java.lang.String getDelimiters()

addSimilarStrings

protected void addSimilarStrings(java.lang.String[] similarToEachOther)
Adds all to similarStrings, each as its own key, and the array is the value. If a String has already been added as a similar string, its value is updated to this new array.


formatWord

public java.lang.String formatWord(java.lang.String word)
Formats a word in a standard way. Makes word lower-case and changes most plurals to singular. Assumes english language. Removes anything thats not letter or digit. Returns null if word looks like its not a word.


addExampleText

public void addExampleText(java.lang.String t)

calculateNewMostCommonWords

public void calculateNewMostCommonWords()
fills word[] (and wordToIndex) with the most common formatted words from exampleText. Call this once after a lot of text has been entered with addExampleText(), and before tokenize().


tokenize

public int[] tokenize(java.lang.String sentences)
Returns a sequence of ints representing the most common words (from the other main text) in 'sentences'. Tokenizes and formats the words before checking if they equal known words.