Blog Post

Reading JSON Data with Python

,

Recently I’ve been looking at archiving some data at SQL Saturday. As a start, I needed to read some of the archive data I have in Python. This post looks at the basics of reading in JSON data in Python, one of the more versatile languages for working with data.

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

Reading JSON Files in Python

I have some data in the SQL Saturday repo in JSON format. This is schedule information, which is exported from Sessionize. I also have XML data, but I decided not to mess with that for now.

Getting this data into a dataset is actually easy in Python. Here are the basics. First, we need to import a few modules. In Python, lots of functionality is from various modules, which aren’t available until added to your workspace. However, they are easy to import.

We need a few modules:

  • json – used to work with json data
  • os – used to work with files and call OS functions.
  • pandas – used for creating dataframes
  • chardet – functions to detect encoding

I’ll import these, though from os I’ll only get a few things.

# Basic import of a JSON file

import json
import chardet
import pandas as pd
import os

Once I’ve done this, I can use these modules in my code.

Now for the code. The first thing is to find my files. I’ve stored json files in a “raw” folder, which I assume is below the place where I’m running the code. In this case, I have two files.

2024-05-06 14_11_40

Here’s a little setup code that sets the path (which could be an argument to the file), but creates a path to the files and starts a loop:

mypath = '.\raw'
onlyfiles = [f for f in os.listdir(mypath) if os.path.isfile(os.path.join(mypath, f)) and f.endswith('.json') ]
# loop through the files
for f in onlyfiles:

In Python, once I want to create a set of code in a loop, I need to indent it, so the next few lines are indented below the for statement above. I’ll repeat that for clarity.

In the loop, I want to do a few things. First, I get a path to the file with the os.path.join  command, which builds me a path that works in various functions. Next, I want to use the chardet module to detect the encoding of the file. They should all be the same, but I had some issues when expecting the default encoding (this post helped). This ensures I get the correct encoding for the file.

Lastly, I’ll open the file.

# loop through the files
for f in onlyfiles:
    currentfile = os.path.join(mypath,f)
    enc=chardet.detect(open(currentfile,'rb').read())['encoding']
with open(currentfile,'r', encoding=enc) as json_data:

Again, I have a with statement and want to run other code, so I’ll indent the next line. The next line reads in the file using the json module, which has a load() function. This function knows how to parse a json file so that we can work with it.

Lastly, outside of the with command, I’ll return to the for loop be de-indenting one level and calling a pandas function to take a portion of the JSON file and load it into a dataframe. Think of a dataframe like a resultset in SQL or a datatable in C#. In this case, I’ll take the “sessions” structure. I print the first five rows with the head() function.

The entire code structure looks like this:

# Basic import of a JSON file
import json
import chardet
import pandas as pd
import os
mypath = '.\raw'
onlyfiles = [f for f in os.listdir(mypath) if os.path.isfile(os.path.join(mypath, f)) and f.endswith('.json') ]
# loop through the files
for f in onlyfiles:
    currentfile = os.path.join(mypath,f)
    enc=chardet.detect(open(currentfile,'rb').read())['encoding']
    with open(currentfile,'r', encoding=enc) as json_data:
        data = json.load(json_data)
    # get session data from json
    df = pd.DataFrame(data['sessions'])
    # print the head
    print(df.head())

The results look like the image below. I’m not covering how to run Python or anything else, but you can see the first five rows with the session title and a couple other elements.

2024-05-06 14_20_09

The raw JSON looks like this for the first file with the sessions element.

2024-05-06 14_21_21

Now I can work with the data and query, transform, rewrite, store in a database, whatever. I’ll cover how to move this data in another post.

SQL New Blogger

This post took me about 15 minutes to write, mostly because of looking up some links. The code itself was for something I was already doing, so after getting the code working, I wrote this post using it.

This is a good example of something you could write to show that you are building some data warehouse skills, which are valuable for many employers.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating