My main area of focus is natural language processing (NLP). I studied a Masters in 2008 at Cambridge University in Computer Speech, Text and Internet Technology and since then I have been working exclusively in machine learning and mostly in NLP. In recent years I have moved into freelance data science consultancy, focusing on NLP.

I have built NLP pipelines from scratch, and worked on natural language dialogue systems, document classifiers and text based recommender systems. For these tasks I have used both traditional machine learning techniques as well as the state of the art such as neural networks.

Natural Language Processing technologies that I use

I have worked on a variety of NLP models, including

Clustering of documents in the topic Natural Language Processing
Topic detection is an NLP technique that allows you to discover common themes in a set of unstructured documents.
  • Bag of words, tf*idf, cosine similarity
  • NLP pipelines, lemmatisation, parsers, chunkers
  • Deep neural networks
  • Clustering: Latent Dirichlet Allocation
    • This is useful for extracting topics from a set of unstructured documents, for example legal documents, survey responses, factory error reports, etc.
  • Search engines and search term recommenders

NLP software

I work with the following programs

  • TensorFlow – an open source library for machine learning, specially designed for deep neural networks
  • Keras – a Python library that assists with TensorFlow
  • Python NLTK – the Natural Language Toolkit. This is an excellent resource for many common business problems involving text and documents.
  • R – the leading statistics language and software, allowing you to conduct sophisticated statistical tests, train machine learning models, and produce stunning graphs.

Examples of past Natural Language Processing projects

NLP projects I have worked on for major household names include

  • a program to identify key opinion leaders from medical publications
  • a deep learning model to classify clinical trials protocols into varying degrees of complexity, which is then input into a pricing model, allowing Boehringer Ingelheim to plan the cost of clinical trials.
  • a dashboard allowing the non-profit White Ribbon Alliance to analyse over 1 million free text survey responses.
  • a spoken dialogue system to control a smart home
  • an unsupervised text analysis program to analyse text descriptions of manufacturing defects
  • a model to classify jobseekers’ CVs into industries and salary bands.
  • analysis of survey responses