Natural Language Processing and Text Analysis

My main area of focus is natural language processing (NLP). I studied a Masters in 2008 at Cambridge University in Computer Speech, Text and Internet Technology and since then I have been working exclusively in machine learning and mostly in NLP. In recent years I have moved into freelance data science consultancy, focusing on NLP.

I have built NLP pipelines from scratch, and worked on natural language dialogue systems, document classifiers and text based recommender systems. For these tasks I have used both traditional machine learning techniques as well as the state of the art such as neural networks.

Natural Language Processing technologies that I use

I have worked on a variety of NLP models, including

Bag of words, tf*idf, cosine similarity
NLP pipelines, lemmatisation, parsers, chunkers
Deep neural networks
- convolutional neural networks (text as well as images)
- RNN, LSTM
- Seq2seq, word2vec, doc2vec
- see a live demo of a CNN for author identification
Clustering: Latent Dirichlet Allocation
- This is useful for extracting topics from a set of unstructured documents, for example legal documents, survey responses, factory error reports, etc.
Search engines and search term recommenders

Clustering of documents in the topic Natural Language Processing

NLP software

I work with the following programs

TensorFlow – an open source library for machine learning, specially designed for deep neural networks
Keras – a Python library that assists with TensorFlow
Python NLTK – the Natural Language Toolkit. This is an excellent resource for many common business problems involving text and documents.
R – the leading statistics language and software, allowing you to conduct sophisticated statistical tests, train machine learning models, and produce stunning graphs.

Examples of past Natural Language Processing projects

NLP projects I have worked on for major household names include

a program to identify key opinion leaders from medical publications
a deep learning model to classify clinical trials protocols into varying degrees of complexity, which is then input into a pricing model, allowing Boehringer Ingelheim to plan the cost of clinical trials.
a dashboard allowing the non-profit White Ribbon Alliance to analyse over 1 million free text survey responses.
a spoken dialogue system to control a smart home
an unsupervised text analysis program to analyse text descriptions of manufacturing defects
a model to classify jobseekers’ CVs into industries and salary bands.
analysis of survey responses

Thomas Wood, freelance data scientist in London, UK. Speciality area: natural language processing (NLP)

Natural Language Processing and Text Analysis

Natural Language Processing technologies that I use

NLP software

Examples of past Natural Language Processing projects