Blog

How to deploy your python project on Databricks

20 Apr, 2022
Xebia Background Header Wave

In recent years, Databricks has become an ever more popular environment for machine learning projects. In the last year or so it has become available on all the popular clouds, and hence it is within reach for nearly every data science team working in the cloud.

One of the core features of Databricks are its notebooks. It makes setting up an interactive working environment with large datasets on a Spark cluster a matter of minutes – creating a notebook, attaching a cluster and you are good to go. But the ease of doing that comes with a downside: notebooks in general are not the best environment to develop quality code. Writing unit tests for code in a notebook is difficult – and often skipped. And as Databricks even allows scheduling notebooks as a job, notebooks that have been written without much structure (sometimes even without placing the functionality in functions) and have never been properly tested end up as production jobs. Such ‘production notebooks’ make my inner engineer shiver.

So if scheduling your notebook is not the right way, how should you deploy production jobs on Databricks?

Step 1: Create a package

The first step is to create a python package. Structure your code in short functions, group these in (sub)modules, and write unit tests. If necessary, create mock data to test your data wrangling functionality. Add a pre-commit hook with linting and type-checking — with for example packages like pylint, black, flake8 and/or mypy. Such an environment allows you to keep your code quality high and prevents many mistakes.

Next up: create an entry point: a single function that will be the starting point of your job. This function typically parses any arguments that were passed to your job and subsequently calls the appropriate functionality from your package. Argument parsing can be done old-school with argparse or with tools like click or typer. Finally you will need to register this function as a console_script-entrypoint in your setup.cfg.

You can find a practical guide to creating a python package and registering an entry point here.

Step 2: Create a wheel

Once you have created your package and you have created the entry point, the next step is to create a wheel. This is required to install your code on a Databricks cluster:

python -m build . —wheel

This will create a file named something like yourpackage_version_py3_none_any.whl in a directory called dist/. That file contains all the code in the package and the appropriate metadata.

Now you are ready to upload this wheel to DBFS. If you haven’t already, install and configure the Databricks CLI. Then uploading your wheel is a matter of running

databricks fs cp dist/<…>.whl dbfs:/some/place/appropriate 

And you are all set!

If you want to interactively use the functionality from your wheel in a Databricks notebook, you can run

%pip install --force "/dbfs/path/to/yourpackage-version-py3-none-any.whl"

And subsequently import anything from your package with

from yourpackage import … 

But please – do not even think about running that in production!

Alternative: PyPI

Even better than uploading your wheel to DBFS would be to upload it to a private PyPI (Python Package Index) such as Artifactory. This would require you to have such an index available – and setting one up is no piece of cake. If you do have one available you can simply replace the whl key in the json files below by something like "pypi": { "package": "yourpackage", "repo": "<a href="https://my-pypi-mirror.com"><a href="https://my-pypi-mirror.com">https://my-pypi-mirror.com</a></a>" }.

Step 3: Define a job

Once your package is built, it is time to define your job. You will need to create a json file that describes your job. Here is a minimal example to get you started:

{
    "name": "my-job",
    "existing_cluster_id": "1234-567890-reef123",
    "libraries": [
        {"whl": "dbfs:/path/to/yourpackage-version-py3-none-any.whl"}
    ],
    "python_wheel_task": {
        "package_name": "yourpackage",
        "entry_point": "yourentrypoint",
        "parameters": ["some", "parameters"]
    }
}

This file specifies that in order to run your job, your wheel has to be installed. And it specifies that it needs to run your entry point from your package, with some parameters.

Unfortunately good, complete and legible documentation on the exact format of this file is impossible to find . As an alternative you can try to create a job using the web UI, and then looking up the json definition.

A common thing you may want to change with respect to the above example is to not specify an existing cluster, but always create a new cluster when the job is launched:

{
    "name": "my-job",
    "new_cluster": {
        "spark_version": "9.1.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 25
    },
    "libraries": [
        {"whl": "dbfs:/path/to/yourpackage-version-py3-none-any.whl"}
    ],
    "python_wheel_task": {
        "package_name": "yourpackage",
        "entry_point": "yourentrypoint",
        "parameters": ["some", "parameters"]
    }
}

Once you have the job definition, creating the job is as simple as running

databricks jobs create —json-file job-config.json

If the job already exists and you want to update it, you’ll need to find out the job id and then overwrite the configuration of the job:

JOB_NAME="my-job"
JOB_ID=$(databricks jobs list --output json | jq -r '[.jobs[] | select(.settings.name == env.JOB_NAME)][0] | .job_id')
databricks jobs reset --job-id "${JOB_ID}" --json-file job-config.json

Running and scheduling

Now that you have your job defined, chances are that you want to run it. You can trigger it from the web interface, or use the Databricks CLI:

databricks jobs run-now --job-id <your-job-id>

You can even override the parameters by providing --python-params.

Now there is only one thing left: scheduling the job to automatically run at the required times. The simplest way is to add a schedule key to your job config:

    "schedule": {
        "quartz_cron_expression": "45 6 12 * * ?",
        "timezone_id": "UTC",
        "pause_status": "UNPAUSED"
    },

This would schedule your job every day at 6 minutes and 45 seconds past noon in UTC. You can find the details of the Quartz Cron syntax here.

Step 4: Advanced scheduling

If you want to have more detailed control of when your job runs, there are several options. For example, you may want to have job that consist of multiple tasks: e.g. data ingestion & preparation, model training and predicting. Some documentation can be found here.

If you happen to have an Airflow instance available, the DatabricksSubmitRunOperator is your friend. It will accept the contents of your job-config.json (as a dictionary, and without the “name” key) and define, run and keep track of your jobs for you.

Wrap up

I hope this blog has given you enough leads to set up and schedule a Databricks job, and that you will never consider putting another notebook into production . Although notebooks are really nice to work with your data interactively, they are not the right environment to write quality code. Putting your code in a package, writing unit tests and applying linting and type-checking — that creates an environment where good code thrives.

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts