T10 dictionary
This project is aimed at developing a T10 dictionary which is an extension of the T9 dictionary commonly found in cell phones. First, let’s take a brief look at the main concept of a T10 dictionary. The T9 dictionary tries to predict the word that a user is entering on the basis of the characters that it has seen so far. Similarly, a T10 dictionary tries to predict the sentence that the user is trying to type based on the previous words, one word at a time. User behavior is observed and recorded in helping with this prediction. Each time a user enters some word, the dictionary will give suggestions about the next word in the sentence.
The program will start with no knowledge of user behavior and will build up a cache as the user uses the software. Whenever a user enters some word, the learning mechanism will be triggered. So for example, if the user enters the sentence ‘Hello world’, the software will record that the user entered the word ‘Hello’ followed by the word ‘world’. Now the next time the user enters the word ‘Hello’, the word ‘world’ will be suggested automatically to the user. Suppose the user now enters the words ‘Hello Raju’, both ‘Raju’ and ‘world’ will be suggested to the user the next time he enters ‘Hello’. To do this, the program will keep track of all the two word pairs, or bigrams, that the user has entered. Thus, in the above case, two bigrams will be recorded, ‘Hello-world’ and ‘Hello-Raju’. Each time a word is entered, the program will look up in its cache of bigrams to see if the entered word is the first word in any bigram. If so, the second word in the bigram will be suggested to the user. The frequencies of the bigrams will also be recorded as the number of times a bigram occurs during usage. So if the word ‘Hello’ is followed by the word ‘world’ more number of times than the word ‘Raju’, ‘world’ will appear higher up in the suggestions than ‘Raju’. Bigrams will only be considered when both the words appear in the same sentence.
The GUI of
the program will be a simple textbox wherein the user can start typing words.
Learning from this usage, suggestions will be displayed to the user through
small pop-up lists from which the user will either be able to pick one of the
suggestions or continue typing a word himself. In any case, the pair of words
that has recently been entered will be recorded to update the frequency of the
bigram for future reference. When the program is to be shut down, it will save
whatever corpus has been learned so far into a formatted XML file so that training
need not be performed on next use. Optionally, the user will have the option to
provide a text file containing sample sentences from which the program will
train itself automatically. Finally, the user will also have controls to export
the corpus explicitly and clear the complete corpus to begin anew.