Python Spark Tutorial – How to resolve the issue of Null Value in SparkDataframe when reading from source like JSON with Schema binding?

How to resolve the issue of Null Value in SparkDataframe when reading from source like JSON with Schema binding_

Problem:

Spark dataframe gives null values for some of the columns which the dataframe is created with Schema bind.

For example, if we have a json file like this 

[
  {
    "firstname": "Neha",
    "middlename": "B",
    "lastname": "Sharma",
    "id": "10002",
    "gender": "F",
    "salary": 6000
  },
  {
    "firstname": "Harsha",
    "middlename": "",
    "lastname": "N",
    "id": "10005",
    "gender": "F",
    "salary": 4500
  },

And while reading the JSON file to spark dataframe, we define some pre-defined schema using structType and StructField, let’s say my schema definition is:

# Here we are trying to give the Schema definition to the data
self.schema = StructType([
    StructField("first_name", StringType(), True),
    StructField("middle_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("unique_id", StringType(), True),
    StructField("gender", StringType(), True),
    StructField("salary_in_doller", IntegerType(), True)
])

Now lets bind this schema to the JSON read method 

def convert_json_to_df(self, file_name):
    print("Converting Data from JSON to DF with schema")
    file = os.path.join('SourceFiles', file_name)
    df = spark.read.schema(self.schema).option("multiline", "true").json(path=file)
    df.printSchema()
    return df

Now let’s print the output of the spark dataframe, so here you can see all the column values turns to null except the gender, why? Because the gender attribute name and spelling were same in both, the schema and the json file. 

In Json: “gender”: “F”,

In Schema: StructField(“gender”, StringType(), True),

Conclusion: So we understood one thing, while reading a JSON file in spark, make sure that the columns name in your schema and JSON file exactly matches. It will show Null values otherwise.

Syncing the schema structField and Json file.

New Json File:

[
  {
    "first_name": "Neha",
    "middle_name": "B",
    "last_name": "Sharma",
    "unique_id": "10002",
    "gender": "F",
    "salary_in_doller": 6000
  },
  {
    "first_name": "Harsha",
    "middle_name": "",
    "last_name": "N",
    "unique_id": "10005",
    "gender": "F",
    "salary_in_doller": 4500
  },
[
  {
    "first_name": "Neha",
    "middle_name": "B",
    "last_name": "Sharma",
    "unique_id": "10002",
    "gender": "F",
    "salary_in_doller": 6000
  },
  {
    "first_name": "Harsha",
    "middle_name": "",
    "last_name": "N",
    "unique_id": "10005",
    "gender": "F",
    "salary_in_doller": 4500
  },

Now lets change schema:

# Here we are trying to give the Schema definition to the data
self.schema = StructType([
    StructField("first_name", StringType(), True),
    StructField("middle_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("unique_id", StringType(), True),
    StructField("gender", StringType(), True),
    StructField("salary_in_doller", IntegerType(), True)
])

Now both the json and schema have the same name, so we can proceed reading the json file with above schema

def convert_json_to_df(self, file_name):
    print("Converting Data from JSON to DF with schema")
    file = os.path.join('SourceFiles', file_name)
    df = spark.read.schema(self.schema).option("multiline", "true").json(path=file)
    df.printSchema()
    return df

So the output would be like below. Here we can see all the fields extract its value properly from JSON file.

Next Upcoming post: 

  1. StructType Inside StructTypes
  2. Tool to read the header of any source file and create a excel with Header and ask user to fill the excel with attribute mapping to datatypes and then the tool will convert that excel to Spark Schema Files. “Source Feed to Spark Mapped Schema Conversion”

Leave a Reply

Your email address will not be published. Required fields are marked *