Desert Academy (36)/Interim Report

Problem Definition
Along with many other aspects of human intelligence, language acquisition is not well understood as a computable system in artificial intelligence. To gain insight into this acquisition mechanism from the perspective of a computer, we intend to program a computer to ascertain the morphology, syntax and semantics of any given language. We will implement techniques similar to those used by children to obtain language, and evaluate their usefulness and applicability for various types of languages, as well as evaluating the inherent differences between human and computer learning of language. The program will be able to distinguish between languages in different samples of text, as well as determine whether a given sample is a valid sample of language and understand rudimentary meaning. Although many attempts have been made to make a computer, which understands human language, none have fully succeeded in the way that we intend. We hope to try and understand the algorithmic basis for linguistics, which can then be treated as a computable system.

Problem Solution
We plan to implement the program primarily in Java, using the statistical occurrences of letter bigrams and word bigrams to learn the basic structure of the language, and using word roots and other clues for understanding of the language. We will use statistical mechanisms to determine the indicative frequency thresholds for given languages, and use comparisons of standard deviation to determine if two samples of text are in the same language. Next, we will allow the program to determine the syntax of a given language by categorizing word bigrams (frequency of two consecutive words) similar to the letter bigrams previously and creating rudimentary part-of-speech categories for the language. Finally, the computer will learn semantics, understanding the relationships between individual words and the surface intentions of the speaker using parts of speech and word roots. Thus, the computer will be able to understand and possibly construct simple sentences.

Progress to Date
So far, we have created a program in C++ to measure the relative frequency of bigrams in a sample of text, and have constructed tables of this frequency for English and French. First, input files are converted to be without punctuation or capital letters to allow for easier determination of bigrams. The program then reads each word in turn into an array, looping through the array and checking for each possible combination of two letters, and outputting a color-coded table with relative frequencies. We have also constructed graphs of the relative frequencies for better visualization of the data. We have used the EU Charter as our input data, as it has been carefully translated into several languages, and have identified certain indicative bigrams for English and French.

Expected Results
When complete, the computer should, given different samples of text in different languages, be able to differentiate between the languages. Once the languages of the different texts have been identified, the computer should begin to process the texts of a given language, in order to determine the parts of speech and their arrangement in a sentence, from which it can then derive a basic structure of grammar for the language. The computer will then determine common roots between the words, from which, coupled with its knowledge of grammar, will gain a literal understanding of the general meaning of the sentence, even if lacking an understanding of the meaning of each individual word. This process will then be repeated for the other languages in the set of texts. If time permits, we hope to advance the program further, creating an algorithm through which the computer will interrelate the individual sentences’ meanings to formulate a gestalt understanding of the body of text.

Introduction
Hello Team 36. My name is Peter Yanke and I have been involved with the Super Computing Challenge for 7 years or more. I am an entrepreneur with 20 plus years of experience in computers and Internet based projects and companies.

=== Project Notes'''

This sounds like a very interesting project. I would limit yourself to the two languages you mentioned and create a more robust model around those two, rather than incorporate any more layers at this point. You have made some good progress, but remember that you want a working model, so if you have to simplify things to achieve that, now is the time to make that decision. Don't lose sight of your expected outcomes and creating reliable results.

Other Considerations
IF you have time, it would be nice to see not only the ability to determine language, but to create a response of some kind in that language. This would give the model a real world application and purpose.

Face to Face Evaluation
Remember that you have a face to face evaluation coming up in February, which should be a scaled down version of your final presentation including some working examples if possible. I would recommend having your code available for review as well.

Summary
I do think this project has some exciting real world applications and look forward to seeing more as you move forward.

Team Comments
Thank you for taking the time to review our report and provide some suggestions. We still hope to create a program which could potentially deal with any language, but we have definitely taken your suggestions in to account, and are currently focusing primarily on the English language. We also plan to design our program to create a summary of a given text, to demonstrate its understanding by creating a response as per your suggestion.

Thanks again,

Megan Belzner