Python Spark Tutorial – How to resolve issues in reading JSON file in Python Spark DataFrame?

How to resolve issues in reading JSON file in Python Spark DataFrame_

Problem Statement:

When we read the json data using spark session object like below, sometime we encountered with the error like below – named _corrupt_record by default

df = spark.read.json(file)

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named _corrupt_record by default) 

Solution:

Spark require a specific format of JSON to be read properly, either each line of JSON file should be a self-contained json object or the entire json should be in one line.

In Spark read json method we have an option to add multiple json object to true in the option property like below:

def read_data_from_json(self, file_name):
    file = os.path.join('SourceFiles', file_name)
    df = spark.read.option("multiline","true").json(file)
    return df

with this we can avoid the named _corrupt_record by default error and can easily read any json object.

Next post we will see how can we read xml file using external databrick package. Till then

Happy Coding

Leave a Reply

Your email address will not be published. Required fields are marked *