Python Spark Tutorial – How to resolve Python Spark Failed to find the data source: com.databricks.spark.xml?

How to resolve Python Spark Failed to find the data source_ com.databricks.spark.xml_

Problem Statement

When you try to read the XML file with spark session you may end up getting below error

org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: com.databricks.spark.xml. Please find packages at `https://spark.apache.org/third-party-projects.html`.

Solution

In PySpark, you can leverage the spark-xml package which is an external tool created by Databricks, to handle the reading and writing of XML files. This package introduces a data source for importing XML files into PySpark DataFrames and a data sink for exporting PySpark DataFrames into XML files.

def read_data_from_xml(self, file_name):
    file = os.path.join('SourceFiles', file_name)
    df = spark.read \
        .format('com.databricks.spark.xml') \
        .options(rowTag='book') \
        .load(file)
    return df

It uses the format provided by the databrick, but to use it we need to download the corrospondant JAR file with the version equal to the scala jar version which you can check going to the Spark Hadoop Home, for my case it is 

C:\spark\spark-3.4.1-bin-hadoop3\spark-3.4.1-bin-hadoop3\jars

Now you need to check the version of scala 

Here we can see the Scala jar version is 2.12, so we need to download the spark-xml 2.12

Once it is downloaded, save it to the spark jar directory

  • C:\spark\spark-3.4.1-bin-hadoop3\spark-3.4.1-bin-hadoop3\jars

Once it is placed, now we can use the ‘com.databricks.spark.xml’ to read the XML file. Below is the code for reading the XML files

def read_data_from_xml(self, file_name):
    file = os.path.join('SourceFiles', file_name)
    df = spark.read \
        .format('com.databricks.spark.xml') \
        .options(rowTag='book') \
        .load(file)
    return df

In next post we will how to add schema to the dataframe creation.

Happy Coding

Leave a Reply

Your email address will not be published. Required fields are marked *