My main area of focus is natural language processing (NLP). I studied a Masters in 2008 at Cambridge University in Computer Speech, Text and Internet Technology and since then I have been working exclusively in machine learning and mostly in NLP. In recent years I have moved into freelance data science consultancy, focusing on NLP.
I have built NLP pipelines from scratch, and worked on natural language dialogue systems, document classifiers and text based recommender systems. For these tasks I have used both traditional machine learning techniques as well as the state of the art such as neural networks.
Natural Language Processing technologies that I use
I have worked on a variety of NLP models, including
- Bag of words, tf*idf, cosine similarity
- NLP pipelines, lemmatisation, parsers, chunkers
- Deep neural networks
- convolutional neural networks (text as well as images)
- RNN, LSTM
- Seq2seq, word2vec, doc2vec
- see a live demo of a CNN for author identification
- Clustering: Latent Dirichlet Allocation
- This is useful for extracting topics from a set of unstructured documents, for example legal documents, survey responses, factory error reports, etc.
- Search engines and search term recommenders
NLP software
I work with the following programs
- TensorFlow – an open source library for machine learning, specially designed for deep neural networks
- Keras – a Python library that assists with TensorFlow
- Python NLTK – the Natural Language Toolkit. This is an excellent resource for many common business problems involving text and documents.
- R – the leading statistics language and software, allowing you to conduct sophisticated statistical tests, train machine learning models, and produce stunning graphs.
Examples of past Natural Language Processing projects
NLP projects I have worked on for major household names include
- a program to identify key opinion leaders from medical publications
- a deep learning model to classify clinical trials protocols into varying degrees of complexity, which is then input into a pricing model, allowing Boehringer Ingelheim to plan the cost of clinical trials.
- a dashboard allowing the non-profit White Ribbon Alliance to analyse over 1 million free text survey responses.
- a spoken dialogue system to control a smart home
- an unsupervised text analysis program to analyse text descriptions of manufacturing defects
- a model to classify jobseekers’ CVs into industries and salary bands.
- analysis of survey responses