Exploring NLP with Coursera - Part 1

Recently, i decided to explore the world of Natural Language Processing. After some research online, I found the course provided by Higher School of Economics on coursera to be one of the highly rated courses and decided to follow it. The course is simply called Natural Language Processing and covers wide range of topics in NLP from basic to advanced. I decided to document my learning by enriching the course material provided during the course with additional exploration on various subjects that might not have specifically been discussed or presented on the course page.

Ultimately these series of posts will cover my notes during my learning journey. It will also serve as a good reference point to quickly remind myself, and anyone who wishes for that fact, major topics in the course. Please feel free to leave a comment or suggestions on how to improve moving forward my note taking.

Disclaimer: These series of posts simply represents a distilled version of the content from the course. It acts as a reminder for myself as well as for anybody who would like to also have notes equivalent of the course.

Let’s get started with Week 1.

Main approaches in NLP

NLP approaches could be summarized into following:

  1. Rule-based methods
    • Regular expressions
    • Context-free grammars
  2. Probabilistic modeling and machine learning
    • Likelihood maximization
    • Linear classifiers
  3. Deep Learning
    • Recurrent Neural Networks
    • Convolutional Neural Networks

Semantic Slot Filling

Semantic slot filling is a problem in Natural Language Processing which describes the process of tagging words or tokens that carry meaning in the sentences in order to make sense of text. Various examples approaches in the area of semantic slot filling.

  • Context-free grammar (CFG) - This method represents a set a of rules that tries to replace non-terminal symbols (symbols or words in a sentence that can be replaced) with terminal ones (symbols or words that can’t be replaced). For example in “Show me flights from Boston[origin] to San Francisco[destination] on 12 December[date]”. Since its rule based, it needs to be done manually potentially with the involvement of a linguist. Usually this approach has high precision but lower recall.
  • Conditional Random Field (CRF) - To continue with the example of “Show me flights from Boston to San Francisco”. We could also build a machine learning model by training the corpus of tokenized data containing features which we could prepare (feature engineering). Example of new features could be
    • Is the word capitalized?
    • Is the word in a list of city names?
    • What is the previous word?
    • What is the previous slot?

    Then we could use a model from a class of discriminative models called Conditional Random Fields (CRF). They are nicely explained by Aditya Prasad in his post. They might be a good candidate in NLP tasks because contextual information or state of the neighbors affect the current prediction of the model. In this scenario we would essentially maximize the probability of the word or word structure given the text. The high level formulas are below

NLP Coursera - Week 1 - Semantic Slot Filling CRF
  • Long Short Term Memory (LSTM) networks - are a type of deep learning approach. In this scenario we would simply feed the sequence of words (vectorized by potentially one-hot encoding) to a neural network with a certain architecture/topology and numerous parameters.

Deep Learning vs Traditional NLP

While Deep Learning performs quite well in many NLP tasks its important not to forget about traditional methods. Some reasons are:

  • Traditional methods perform quite nice for tasks such as sequence labeling, which is basically a pattern recognition method where a categorical label is assigned to a member of a sequence of observed values (wikipedia).
  • word2vec method which is actually not even deep learning but it is inspired by some neural networks, has really similar ideas as some distributional semantic methods have. Despite not being a DLP method, word2vec acts like a two-layer “neural net” and vectorizes words. Detailed introduction to word2vec here.
  • We can also use knowledge in traditional NLP in improving the deep learning methods and approaches.

Nevertheless, Deep Learning currently seems to be the future for doing NLP and will become more common in the years to come.

Next section, “Brief overview of the next weeks” is skipped since it simply summarizes what’s to come.

Linguistic Knowledge in NLP

Natural Language Processing and understanding is not only about mathematics but also linguistics. Thus its important to cover the following NLP Pyramid.

To understand the above pyramid, let’s say we have a sentence as an example. There are multiple stages of analysis of that sentence

  • Morphology - Morphology is the study of the structure and formation of words. Basically everything that concerns individual words in the sentence. At this stage we care about
    • forms of words (sawed, sawn, sawing) further reading
    • part of speech text (noun, pronouns, verbs, etc.)
    • different cases (nominative, accusative, etc.) further reading
    • genders (masculine, feminine) further reading
    • tenses (present, past, future, etc.)
  • Syntax - is about the different relationships of words within the sentence. It represents the set of rules, principles and processes that govern the structure of sentences in a given language, usually including word order. Interesting video summary here. Simple sentences follow a basic structure: subject - verb - object. For example - “The girl[subject] bought[verb] a book[object]”
  • Semantics - as the next step in the pyramid, it represents the meaning and interpretation of words, signs and sentence structure. Some resources for further reading.  Linguistics 001 and What does semantics study
  • Pragmatics - is a next level in the abstraction presented in the pyramid and studies the practical meaning of words within various interactional contexts. Fore more explanation of pragmatics read here

Some libraries and tools to use to explore more of each of the stages in the pyramid.

  1. NLTK Library- contains many small and useful datasets with markup as well as various preprocessing tools (tokenization, normalization, etc.). It also contains some pre-trained models for POS-tagging and parsing. A great book for further exploring NLTK and in general Python in Natural Language Processing (NLP) is Natural Language Processing with Python.
  2. Stanford Parser  - used for syntactic analysis a natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as “phrases”) and which words are the subject or object of a verb.
  3. spaCy - python library for text analysis, e.g. for word embeddings and topic modeling. It is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
  4. Gensim - is a python library for text analysis, unsupervised topic modeling and nlp. More information on wikipedia and some core concepts explored here
  5. MALLET- similar to gensim but written in Java.

To explore further the relationships between words such as synonyms (two different words with similar meaning) or homonyms (two or more words with same spelling but different meanings and origins), libraries such as WordNet - a lexical database for English and BabelNet - the multilingual alternative to wordnet, would be useful to explore.

Linguistic knowledge + Deep Learning

In NLP, one of the tasks is reasoning. Let’s say there is a story. Mary got the football, she went to the kitchen, she left the ball there. Here we have some story, and now we have a question after this story, where is the football now? LSTM networks, a particular type of recurrent neural networks, could be used in this scenario.

NLP Coursera - Week 1 - Linguistic Knowledge + Deep Learning

In the picture above we can see that the red lines (or edges) stand for co-reference, which means that mary and she represent the same entity - that is Mary. Meanwhile, green line represents hypernyms (see Wikipedia). In our case football is a type of a ball.

In summary, the knowledge of linguistics allows to identify a method for tackling an NLP challenge, in our case by identifying these relationships we could be adding more edges to our DAG-LTSM approach. DAG-LTSM stands or Dynamic Acyclic Graph - Long Short Term Memory Recurrent Neural Network (a mouthful).

Syntax: dependency & constituency trees

Another example of linguistic knowledge used in applied NLP is representing syntax via dependency or constituency trees. The images below show examples. In case of dependency trees, (left image) the sentence would be parsed based on various dependencies present in the sentence (e.g subject, object, modifier and etc.). In case of constituency trees a parser would parse the sentence from bottom to top to get a hierarchial structure. Each of the nodes represent a syntactial element (e.g. Noun, noun phrase, verb, verb phrase). By parsing it in a hierarchical structure it allows to find named entities (NE) such as New York City, since NEs are most likely to be noun phrases.

NLP Coursera - Week 1 - Linguistic Knowledge + Deep Learning

It is also useful in sentiment analysis. In sentiment analysis, we try to predict the sentiment given the text. In the image above (3) we can see that a parser similar to the previously mentioned one would have a tree structure with nodes being individual words with a certain sentiment. By looking at nodes the overall sentiment could be calculated. This is an example of using recurrent neural network or DAG networks to predict sentiments, but in practice simple classification at times is sufficient enough.