Freelance data scientist hourly rate

How much is a freelance data scientist’s hourly rate? There is no simple answer and it doesn’t make much sense to talk about average rates. The main factors determining a freelance data scientist’s hourly rate are location, speciality and experience level, not to mention the data scientist’s business acumen and negotiation skills.

It can be hard to find out exactly what hourly rate a freelance data scientist can charge because freelance work is not always advertised and when it is, the rate is usually negotiable.

How do qualifications and experience level affect a freelance data scientist’s hourly rate?

A freelance data scientist hourly rate varies according to where they lie on the scale from novice to expert consultant
A freelance data scientist’s hourly rate varies according to where they lie on the scale from novice to expert consultant

Let’s imagine a number of scenarios. I am quoting some typical figures for UK- or US-based freelance data scientist hourly rates but the numbers would vary by location.

Novice freelance data scientist hourly rate

A novice freelance data scientist who is fresh out of university can probably use online marketplaces such as Upwork to apply for small freelance data science gigs, and will be competing with people of similar experience levels worldwide. The competition will include people in low-income countries and people who have learnt data science on a bootcamp course rather than a degree programme, and the clients may not always know how to distinguish qualified freelancers from unqualified, so there will be a race to the bottom on novice freelance data scientist hourly rates. You can read my post on how to become a freelance data scientist for more information about joining Upwork and other marketplaces, and what to expect.

My Upwork profile. You can put a freelance data scientist hourly rate on your Upwork profile.
My Upwork profile. You can put an hourly rate on your Upwork profile.

Hourly rates are advertised on Upwork and most likely the novice freelance data scientist can charge an hourly rate of $50. Most of the gigs they take on will be relatively low-level, perhaps mundane tasks in Python rather than cutting-edge machine learning, and this is reflected in an inexperienced freelance data scientist’s hourly rate. However there do exist some highly-paid expert jobs, mainly under the “UK only” or “US only” filter, which will pay a market rate better suited to your location.

freelance data scientist hourly rate upwork graph min
Freelance data scientist hourly rates averaged around $80/hour for contracts in Natural Language Processing on Upwork in 2020. There were a few very highly-paid outliers, but the majority of Upwork contracts which I could find were in the <$50/hour range.

Moderately experienced freelance data scientist hourly rate

A freelance data scientist with a moderate amount of experience can charge a slightly higher hourly rate. Imagine a freelance data scientist with a Masters or PhD and a few years of experience working in small but unknown companies.

A moderately experienced freelance data scientist can use their connections and network on LinkedIn to find more challenging work. Perhaps they will have a degree of experience in a particular industry and they can leverage that. A moderately experienced freelance data scientist could charge an hourly rate of about $100.

I have found a few interesting freelance jobs simply from in-person and online networking. For example, a university alumnus might recommend a colleague who has an interesting problem in NLP and which might be right up my street. These kinds of leads can be very valuable when they come in.

Contractor freelance data scientist hourly rate

A freelance data scientist with a decade or more of experience can charge much higher hourly rates. Let us imagine a freelance data scientist who has been working in machine learning since before the term “data science” was widely used. This person has worked across a range of industries. This freelance data scientist prefers to work in comfortable long term contracts for a single client, alongside the client’s permanent employees. The rates for contracting are good and a contractor freelance data scientist can charge hourly rates up to $200.

In the UK it’s possible to get some feel for what hourly rates a freelance data scientist can charge, simply by looking at the advertised rates for contractor roles.

I’ve taken a sample of 54 data science contract positions in natural language processing which were advertised in London in 2018-2020 on a variety of marketplaces. I have taken the daily or hourly rates which were either advertised or stated by the recruiter when I spoke on the phone. You can see that there is a lot of variation but the contracted hourly rates averaged around $115/hour. In general, freelance data science contracts in natural language processing or computer vision tended to pay higher hourly rates than contracts in general data science, and the hourly rates are also higher than the rates for employment.

It should be noted that I avoided startups and small companies when choosing the advertised contract roles, so these are generally the higher paying roles on the UK market.

Freelance data scientist hourly rates averaged around $115/hour for contracts in Natural Language Processing in London in 2018-2020, however there was a lot of variation.
Freelance data scientist hourly rates averaged around $115/hour for contracts in Natural Language Processing in London in 2018-2020, however there was a lot of variation.

Expert consultant freelance data scientist hourly rate

Finally, there are the consultants. A consultant data scientist does not apply for jobs or gigs on any kind of marketplace, but rather uses networking and even direct sales pitches. This person operates effectively as a small company and may compete against small and medium-sized consultancies, and this is reflected in the consultant freelance data scientist’s rates.

The consultant may have two decades of experience, will have written a series of books and may be an in-demand speaker or lecturer. A consultant will help companies with their long term data strategy, rather than simply complete a pre-defined task. Large companies know to look for the consultant for difficult problems.

An expert consultant freelance data scientist does not have an explicit hourly rate, but charges clients on a per-project basis. They may waive fees if a project does not deliver the expected result. The sales process may involve a series of presentations and unpaid proof-of-concepts, and the consultant may enlist a sales representative to help with sales. Working from the fixed price charged to clients, the expert consultant freelance data scientist has an effective hourly rate of $500 and up.

How does a freelance data scientist’s hourly rate vary by location?

Location is a very important factor in determining a freelance data scientist’s hourly rate. In general, the USA, especially the West Coast, has the highest demand and highest rates. The geographical variation in data scientist salaries and rates is dampened slightly by the fact that a lot of work can be done remotely, however the rates still vary hugely between locations.

I was unable to obtain comprehensive data on freelance data scientists’ hourly rates between countries, so as a proxy I have used permanent data scientists’ reported salaries to estimate the geographical variation in four countries.

We can see that North America has higher rates than Europe across the board. Within countries there is also considerable variation, with cities paying considerably more. In addition, the wealth of a country is not necessarily a predictor of the hourly rate. Data science is a very important field in the USA, but less so than in, for example, France, which is a much more conservative country when it comes to technology.

I suspect also that larger countries such as the USA, UK and Germany tend to have more demand for data scientists than wealthy smaller countries, because companies in large countries have huge customer bases and larger datasets to work with. Within Europe I have definitely found a divide between north and south, with the UK, Germany, the Netherlands and Scandinavia having very highly paid freelance and permanent data science jobs on offer, while southern European countries such as Spain and Italy do not have such a high demand for data science services and consequently pay much lower freelance data science hourly rates.

Estimated data scientist hourly rates in four countries.
Estimated data scientist hourly rates in four countries. Data from payscale.com.

How does a freelance data scientist’s hourly rate vary by speciality?

In recent years, demand has grown for data science specialists who can work with more generalist data scientists and analysts. For example, experts in natural language processing, computer vision, and deep learning libraries are in high demand, and can consequently charge hourly rates much higher than a data scientist who works with the basic toolkit in Scikit-Learn. I would venture to say that freelance data scientists in these niche areas could charge double the hourly rates of their generalist counterparts, although I do not currently have data to back this up.

Conclusion

There are many factors affecting a freelance data scientist’s hourly rate.

First of all, the data scientist’s location is an important factor, with many companies in the US preferring to hire US-based freelancers, even if the work is completely remote.

Secondly, freelancers with an in-demand speciality such as natural language processing can increase their hourly rates accordingly, as they do not need to compete with so many people.

Thirdly, the freelance data scientist must know how to negotiate rates, and find clients directly. If the freelancer uses marketplaces such as Upwork, the competition will force a race to the bottom on cost, and the freelancer will wish to avoid this.

Fast Stylometry Tutorial

Forensic stylometry author identification demo at fastdatascience.com
Click to try the live demo of the Forensic Stylometry library faststylometry.

I’m introducing a Python library I’ve written, called faststylometry, which allows you to compare authors of texts by their writing style. The science of comparing authors’ fingerprints in this way is called forensic stylometry.

You will need a basic knowledge of Python to run this tutorial.

The faststylometry library uses the Burrows’ Delta algorithm, a well-known stylometric technique. The library can also calculate the probability that two books were by the same author.

I wrote this library to improve my understanding, and also because the existing libraries I could find were focused around generating graphs but did not go as far as calculating probabilities.

Here I am giving a walkthrough of how to conduct a basic stylometry analysis using the faststylometry library. We will test the Burrows’ Delta code on two “unknown” texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Brontë. Both authors are in our training corpus.

pca detail min 1
Graph showing the proximities of authors’ writing styles calculated using Principal Component Analysis (PCA).

Burrows’ Delta algorithm

The Burrows’ delta is a statistic which expresses the distance between two authors’ writing styles. A high number like 3 implies that the two authors are very dissimilar, whereas a low number like 0.2 would imply that two books are very likely to be by the same author. Here is a link to a useful explanation of the maths and thinking behind Burrows’ Delta and how it works.

The Burrows’ delta is calculated by comparing the relative frequencies of function words such as “inside”, “and”, etc, in the two texts, taking into account their natural variation between authors.

Installing faststylometry

If you’re using Python, you can install the library with the following command:

pip install faststylometry

The Jupyter notebook for this walkthrough is here.

Loading the library

First, we start up Python. First we need to import the stylometry library:

from faststylometry import Corpus
from faststylometry import load_corpus_from_folder
from faststylometry import tokenise_remove_pronouns_en
from faststylometry import calculate_burrows_delta
from faststylometry import predict_proba, calibrate

The library depends on NLTK (the Natural Language Toolkit), so the first time that you are using it, you may need to run the following commands in your Python environment if you want to use the inbuilt tokeniser:

import nltk
nltk.download("punkt")

Loading some books into a corpus

I have provided some test data for you to play with the library, which you can download from the project Github here. It’s a selection of classic literature from Project Gutenberg, such as Jane Austen and Charles Dickens. Due to copyright, I cannot provide more modern books, but you can always obtain them elsewhere.

If you are using Git, you can download the sample texts with this command:

git clone https://github.com/fastdatascience/faststylometry

Make sure the book texts are in the folder faststylometry/data/train on your computer, and each file is named “author name”_-_”book title”.txt, for example:

stylometry folder screenshot
The training data supplied in the repository, a selection of English literature from Project Gutenberg, with structured filenames.

You can now load the books into the library, and tokenise them using English rules:

train_corpus = load_corpus_from_folder("faststylometry/data/train")

train_corpus.tokenise(tokenise_remove_pronouns_en)

Alternatively, you can add books to your corpus using the following process:

corpus = Corpus()
corpus.add_book("Jane Austen", "Pride and Prejudice", [whole book text])

Finding the authorship of an unknown book

I have also provided some “unknown” books for us to test the performance of the algorithm. Imagine we have come across a work for which the author is unknown. The books I’ve included are Sense and Sensibility, written by Jane Austen (but marked as “janedoe”), and Villette, written by Charlotte Brontë, which I have marked as “currerbell”, Brontë’s real pseudonym. They are in the folder faststylometry/data/test.

forensic stylometry test texts min
The two test documents which I have included in the library. “Currer Bell” is really Charlotte Brontë and “Jane Doe” is really Jane Austen.

Here is the code to load the unknown documents into a new corpus, and tokenise it so that it is ready for analysis:

# Load Sense and Sensibility, written by Jane Austen (marked as "janedoe")
# and Villette, written by Charlotte Brontë (marked as "currerbell", Brontë's real pseudonym)

test_corpus = load_corpus_from_folder("faststylometry/data/test", pattern=None)
# You can set pattern to a string value to just load a subset of the corpus.

test_corpus.tokenise(tokenise_remove_pronouns_en)

Calculate Burrows’ Delta for both unknown texts

Now we have a training corpus consisting of known authors, and a test corpus containing two “unknown” authors. The library will give us the Burrows’ Delta statistic as a matrix (Pandas dataframe) for both unknown texts (x-axis) vs all known authors (y-axis):

calculate_burrows_delta(train_corpus, test_corpus, vocab_size = 50)

We can see that the lowest values in each column, so the most likely candidates, are Brontë and Austen – who are indeed the true authors of Villette and Sense and Sensibility.

currerbell – villettejanedoe – sense_and_sensibility
author
austen0.9979360.444582
bronte0.5213580.933160
carroll1.1164661.433247
conan_doyle0.8670251.094766
dickens0.8002231.050542
swift1.4808681.565499

It’s possible to take a peek and see which tokens are being used for the stylometric analysis:

top tokens min
Preview of the 50 tokens used for our stylometric analysis. Note that these are all function words, excluding pronouns (which could obviously vary between first- annd third-person writing styles and not be indicative of authorship).

Calibrate the model and calculate the probability of each candidate in the training set being the author

Now the Burrows’ delta statistic above can be a little hard to interpret, ,and sometimes what we would like would be a probability value. How likely is Jane Austen to be the author of Sense and Sensibility?

We can do this by calibrating the model. The model looks at the Burrows’ delta values between known authors, works out what are the commonest values indicating same authorship:

calibrate(train_corpus)

After calling the calibrate method, we can now ask the model to give us the probabilities corresponding to the delta values in the above table:

predict_proba(train_corpus, test_corpus)

You can see that we now have a 76% probability that Villette was written by Charlotte Brontë.

currerbell – villettejanedoe – sense_and_sensibility
author
austen0.3242330.808401
bronte0.7573150.382278
carroll0.2314630.079831
conan_doyle0.4452070.246974
dickens0.5105980.280685
swift0.0671230.049068

As an aside: by default the library uses Scikit Learn’s Logistic Regression to calculate the calibration curve of the model. Alternatively, we could tell it which model to use by supplying an argument to the calibrate method:

calibrate min

Plot the calibration curve

We can plot the calibration curve for a range of delta values:

import numpy as np
import matplotlib.pyplot as plt
x_values = np.arange(0, 3, 0.1)
plt.plot(x_values, train_corpus.probability_model.predict_proba(np.reshape(x_values, (-1, 1)))[:,1])
plt.xlabel("Burrows delta")
plt.ylabel("Probability of same author")
plt.title("Calibration curve of the Burrows Delta probability model\nUsing Logistic Regression with correction for class imbalance")
forensic stylometry burrows delta probability calibration curve min
Calibration curve of the Burrows Delta probability model, using Logistic Regression with correction for class imbalance

We can see that a value of 0 for delta would correspond to a near certainty that two books are by the same author, while a value of 2 corresponds to a near certainty that they are by different authors.


Plot the ROC curve

The ROC curve and the AUC metric are useful ways of measuring the performance of a classifier.

The Burrows’ Delta method, when used as a two-class text classifier (different author vs. same author), has an incredibly easy task, because it has learnt from entire books. So we would expect the classifier to perform very well.

We can perform the ROC evaluation using cross-validation. The calibration code above has taken every book out of the training corpus in turn, trained a Burrows model on the remainder, and tested it against the withheld book. We take the probability scores resulting from this, and calculate the ROC curve.

An AUC score of 0.5 means that a classifier is performing badly, and 1.0 is a perfect score. Let’s see how well our model performs.

First, let’s get the ground truths (False = different author, True = same author) and Burrows’ delta values for all the comparisons that can be made within the training corpus:

ground_truths, deltas = get_calibration_curve(train_corpus)
ground truths deltas min
The ground truths are Boolean values and the deltas are typically between 0 and 3. These delta values cannot be used directly to compute a ROC curve, since for that we would need probabilities.

We get the probabilities of each comparison the model has made by putting the Burrows’ delta values back through the trained Scikit-Learn model:

probabilities = train_corpus.probability_model.predict_proba(np.reshape(deltas, (-1, 1)))[:,1]
probabilities min 1
The Scikit-Learn classifier has transformed the delta values into convenient probability values, lying between 0 and 1, which can be used to calculate the ROC curve.

We can put the probabilities and ground truths into Scikit-Learn to calculate the ROC curve:

from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(ground_truths, probabilities)

We can now calculate the AUC score. If our model is good at identifying authorship, we should see a number close to 1.0.

roc_auc = auc(fpr, tpr)

Finally, we can plot the ROC curve:

plt.figure()
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.4f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating curve of the Burrows\' Delta classifier\noperating on entire books')
plt.legend(loc="lower right")
plt.show()
burrows delta roc curve auc min
The ROC curve of the Burrows’ delta classifier trained with cross-validation within the training set. Note the surprisingly high AUC, 0.9974, indicating that on this training set (entire books), we have something close to a perfect classifier.

Segment the corpus and display the various books in a scatter graph using Principal Component Analysis

We can also visualise the stylistic similarities between the books in the training corpus, by calculating their differences and using Principal Component Analysis (PCA) to display them in 2D space.

For this, we need to use the Python machine learning library Scikit-Learn.

from sklearn.decomposition import PCA
import re
import pandas as pd

We can re-load the training corpus, and take segments of 80,000 words, so that we can include different sections of each book in our analysis.

# Reload the training corpus as the "test corpus", re-tokenise it, and segment it this time
test_corpus = load_corpus_from_folder("faststylometry/data/train")
test_corpus.tokenise(tokenise_remove_pronouns_en)
split_test_corpus = test_corpus.split(80000)

Now we calculate the Burrows’ delta statistic on the book segments:

df_delta = calculate_burrows_delta(train_corpus, split_test_corpus)

We are interested in the array of z-scores

df_z_scores = split_test_corpus.df_author_z_scores
forensic stylometry burrows delta z scores min
The Burrows’ delta is calculated from z-scores of every common word in the vocabulary, which represent the fingerprint of the book and can be compared across books.

The above array is too big to display directly, so we need to reduce it to two-dimensional space to show it in a graph. We can do this using Scikit-Learn’s principal component analysis model, setting it to 2 dimensions:

pca_model = PCA(n_components=2)
pca_matrix = pca_model.fit_transform(df_z_scores)
burrows delta pca matrix min
Matrix of PCA coordinates for the books in our split test set, derived using PCA from the 50-dimensional Z-scores that were calculated for each book.

It would be nice to plot the book sections on a graph, using the same colour for every book by the same author. Since the Z-scores matrix is indexed by author and book name, we can use a regex to take everything before the first hyphen. This gives us the plain author name with the book title removed:

authors = df_z_scores.index.map(lambda x : re.sub(" - .+", "", x))
forensic stylometry authors min
The list of the authors’ names for all the book segments for which we have calculated the

We can join the PCA-derived coordinates and the author names together into one dataframe:

df_pca_by_author = pd.DataFrame(pca_matrix)
df_pca_by_author["author"] = authors

Now we can plot the individual books on a single graph:

plt.figure(figsize=(15,15)) 

for author, pca_coordinates in df_pca_by_author.groupby("author"):
    plt.scatter(*zip(*pca_coordinates.drop("author", axis=1).to_numpy()), label=author)
for i in range(len(pca_matrix)):
    plt.text(pca_matrix[i][0], pca_matrix[i][1],"  " + df_z_scores.index[i], alpha=0.5)

plt.legend()

plt.title("Representation using PCA of works in training corpus")
forensic stylometry author graph min
Graph of the books in the training corpus, segmented into 80,000 tokens, with Burrows’ delta Z-scores calculated for the top 50 words, and reduced to 2 dimensions using PCA.

Taking it further

The code for the ROC/AUC means that we can try out different parameters, changing the vocabulary size, document length, or preprocessing steps, to see how this affects the performance of the Burrows’ delta method.

I would be interested to find out how the delta performs on other languages, and if it would be beneficial to perform morphological analysis as a preprocessing step in the case of inflected languages. For example, in the Turkish word “teyzemle” (with my aunt), should this be treated as teyze+m+le, with the two suffixes input separately?

You can try this kind of experiment by replacing the function tokenise_remove_pronouns_en with a tokenisation function of your choice. The only constraint is that it must convert a string to a list of strings.

References


Warning: Illegal string offset 'sfsi_shuffle_icons' in /home/customer/www/freelancedatascientist.net/public_html/wp-content/plugins/ultimate-social-media-icons/libs/sfsi_widget.php on line 229