There is a big difference between recognizing natural language and understanding it. Machine Learning has given the impression that it has or will solve natural language processing and that dictionaries, syntax, and all the traditional trappings of linguistics are simply not needed. This could not be further from the truth. The solution lies in understanding and, until now, human-like understanding has been impossible to achieve.
Linguistic Facts Tell Us What To Do
It takes children 4 to 14 years to learn a language. English takes 4 years to master and Finnish takes 14. Even with the most sophisticated brain, neural network, and high quality sensors, it takes humans many years to learn a language. This gives us the most important clue in our goal of getting computers to read and understand text like humans.
If you put children on a desert island without language, they may develop some primitive gestures and sounds, but they won’t come off the island ten years later speaking what is considered to be a modern, mature human language. Language is a system that has developed over thousands of years, accumulating words, concepts, and complex syntactic systems that children take years to master. It is a learned skill. The ability to read and write has to be taught separately, and that in itself takes many more years to master. Combined with the long journey to acquiring world knowledge, understanding language is a multi-layered, complex task.
If we expect computers to read and understand text, they must exploit the same linguistic and world knowledge as humans. This is beyond a pattern recognition task, which is the realm of Machine Learning. You can give Machine Learning all the English text in the world, but it will still not know that grandma is a person or a female and be able to use that in a useful task. It could discover that those two words are related but if you ask it what is the meaning of female, it will not know. Machine learning cannot “understand” language by simply looking at text.
What Was Needed
Until the turn of the millennium, automating the understanding of human language seemed far from reach. Hand-built systems were cobbled together at universities with dictionaries, syntax rules, and semantics to do a reasonable job of parsing. But these were hard to modify and to extend beyond specific tasks and domains. What is worse, such systems were siloed, usually as a sequence of black boxes, with increasing degradation at each step in the processing. Such systems couldn’t operate in a cohesive or synergistic manner.
What was needed was a “glass box” system, where lexical, syntactic, and semantic processing could all work synergistically with feedback to achieve as accurate a result as possible. Integration with a knowledge base for linguistic, domain, and world knowledge was a critical component. Such a knowledge base could build conceptual representations on the fly for the raw text being processed. A unified programming language to address both the knowledge and the language processing would yield a flexible and extensible development environment.
NLP++ and VisualText
Such a system was developed between 1998 and 2001. The computer language NLP++ and its IDE, VisualText were developed in order to construct computer programs what were in effect “Digital Human Readers”. NLP++ allowed for the syntactic matching of patterns in text, the use of linguistic and world knowledge, and the ability to create knowledge on-the-fly while processing.
Some of the advantages of such a system are that, unlike Machine Learning, NLP++ relies on humans encoding linguistic and world knowledge needed to do a task, whereas Machine Learning requires large amounts of text for training purposes. There are several advantages that come with such a system:
- A system can be developed with a small corpus of text
- If a problem arises, it is easy to fix given 100% of the code for the system is visible (“glass box”)
- The system can easily be enhanced with time like any computer program
Automation Versus Building
With the advent of Machine Learning (ML), we have been seduced into thinking that everything can be learned automatically by computers. Machine Learning today cannot learn the principal players in court cases and incorporate them in a system to understand court cases. Machine learning cannot learn that the length of an appendix (organ) is 8-10 centimeters and alert a doctor that a radiology report indicates appendicitis. Nor can Machine Learning learn that the son of a mother is also the son of the father in text about genealogy.
If, however, we build systems using the necessary linguistic and world knowledge needed to perform a task, we CAN build Digital Human Readers that can understand text at the same level as human beings.
We have the tools, now we need to have the wherewithall to build the necessary dictionaries, knowledge bases, and processing algorithms that allow us to use this technology across industry and truly revolutionize our capabilities in getting computers to read and understand text like humans.
Links for NLP++ and VisualText
- Two Videos on HPCC Systems Youtube Channel: https://youtu.be/XhoI4rkaKN8, https://youtu.be/qT1nFLflHZw
- Open Source Website: http://visualtext.org
- Github repository: https://github.com/VisualText
- VSCode NLP++ Language Extension (IDE for NLP++): http://vscode.visualtext.org
- VisualText Youtube Channel: http://youtube.visualtext.org
- Formal language description: http://wiki.visualtext.org
David de Hilster is a Consulting Software Engineer on the HPCC Systems Development team and has been a computer scientist in the area of artificial intelligence and natural language processing (NLP) working for research institutions and groups in private industry and aerospace for more than 30 years. He earned both a B.S. in Mathematics and a M.A. in Linguistics from The Ohio State University. David has been with LexisNexis Risk Solutions Group since 2015 and his top responsibility is the ECL IDE, which he has been contributing to since 2016. David is one of the co-authors of the computer language NLP++, including its IDE VisualText for human language called NLP++ and VisualText.
David can frequently be found contributing to Facebook groups on natural language processing. In his spare time, David is active in several scientific endeavors and is an accomplished author, artist and filmmaker.