At my current client, we use Sonatype Nexus to store our artifacts. The repository is secured with a username/password both for publishing as downloading artifacts.
Spark is having support for specific repositories with the –repositories configuration.
We use it like this:
pyspark \ --repositories https://readonly:[email protected]/repository/maven-public/\ --packages com.example:foobar:1.0.0
Unfortunately, we ran into the following issue:
==== repo-1: tried https://readonly:[email protected]/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom -- artifact com.example#foobar;1.0.0!foobar.jar: https://readonly:[email protected]/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar :::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: com.example#foobar;1.0.0: not found ::::::::::::::::::::::::::::::::::::::::::::::
The strange thing: The url is correct. With curl we can download the dependency:
curl -s -o /dev/null -v https://readonly:[email protected]/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom * Hostname was NOT found in DNS cache * Trying 35... * Connected to foobar.com (35.xxx.xxx.x) port 443 (#0) * successfully set certificate verify locations: * CAfile: none CApath: /etc/ssl/certs ... ... 200 OK
Okay, let’s debug this thing by using ivy directly.
Ivy is using a config file to configure the Nexus repository so I tried:
defaultResolver="nexus"/> name="nexus-public" value="https://nexus/repository/maven-public"/> name="nexus" m2compatible="true" root="${nexus-public}"/>
curl -L -O http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
java -jar ivy-2.4.0.jar -settings ivy.settings -dependency com.example foobar 1.0.0 -debug
Here we end up with the same issue. So the issue is not Spark related, but Ivy.
==== nexus: tried https://readonly:[email protected]/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.pom -- artifact com.example#foobar;1.0.0!foobar.jar: https://readonly:[email protected]/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar :::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: com.example#foobar;1.0.0: not found ::::::::::::::::::::::::::::::::::::::::::::::
With the -debug
option we find the following:
HTTP response status: 401 url=https://readonly:[email protected]/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar CLIENT ERROR: Unauthorized url=https://readonly:[email protected]/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar nexus: resource not reachable for com/example#foobar;1.0.0: res=https://readonly:[email protected]/repository/maven-public/com/example/foobar/1.0.0/foobar-1.0.0.jar
Now we understand the issue, we can start googling. I found this StackOverflow issue
So Let’s change the basic authentication in the URL to a credentials
block.
defaultResolver="nexus"/> name="nexus-public" value="https://nexus/repository/maven-public"/> host="nexus" realm="Sonatype Nexus Repository Manager" username="readonly" passwd="secret_password" /> name="nexus" m2compatible="true" root="${nexus-public}"/>
Now everything works like a charm. Time to fix the pyspark command.
pyspark\ --packages com.example:foobar:1.0.0\ --conf spark.jars.ivySettings=/tmp/ivy.settings
Now Spark is able to download the packages as well. I’m a happy camper again.
What is left for us to do, is to add this in our init script to initialize new dataproc clusters with this setup.
Subscribe to our newsletter
Stay up to date on the latest insights and best-practices by registering for the GoDataDriven newsletter.