Reading 15 most common file formats used in Data Science (using Python)

Here in this post, we will see how we can read most common file formats used in Data Science (using Python) with code.

When you are working as a Data Scientist or Data Analyst, your most of the time is utilized in reading the dataset from different sources (including data saved in different file formats and different data types), do some data wrangling to get data into shape, and then finally converting and saving it to different file formats so that it can be used to build Machine Learning models.

When as a Data Scientist you start working on any project that required dealing with structured data like Excel file, SQL etc. along with unstructured data which contain text and are not in form of rows and columns like emails, images etc. So it’s important to have a basic idea of how to read most common file formats used in Data Science using the language they are comfortable with, either Python or R.

Let’start our way towards reading most common file formats used in Data Science (using Python).

List of file formats

Here’s a below list of common file formats used in Data Science:

CSV
Text Files
JSON
Microsoft Excel File
SAS
SQL
Python Pickle File
Stata
HDF5
HTML
ZIP
PDF
DOCX
Images
Google Bigquery

Let’s see how to read most common file formats used in Data Science (using Python):

Reading the data from CSV file

CSV stands for comma separated values which is a text-based file format that store data in a tabular form similar to a spreadsheet or a database table and generally use a comma to separate values and has an extension of .csv. This is how a CSV file looks like when you open it with Notepad:

most common file formats used in Data Science

Here’s a code which shows how you can import CSV file:

import pandas as pd
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.csv"
data = pd.read_csv(filename)

for more info, check here.

Reading the data from Microsoft Excel File

Microsoft Excel is a spreadsheet developed by Microsoft. Sometimes you will see two different file format for Excel. First is XLS which is an extension for the version of Excel prior to 2007 while XLSX is what you see nowadays created on the version of Excel 2007 and onward. The important difference between both file extensions is that XLS is a binary format while XLSX is an Open XML format.

Here’s a code which shows how you can import .xlsx file:

import pandas as pd
filename = "C:\\Users\\Pankaj\\Downloads\\Testfile.xlsx"
df = pd.read_excel(name, sheetname = “Test”)

for more detailed info check here.

Reading the data from Zip File

ZIP is an archive file format that supports lossless data compression. You can read a zip file by importing the “zipfile” module.

Here’s a python code which you can use to read the file that is inside the “Test.zip”.

import zipfile
archive = zipfile.ZipFile('Test.zip', 'r')
df = archive.read('Test.csv')

for more info on zipfile module check here.

Reading the data from SQL

SQL stands for Structured Query Language. SQL lets you managing data held in a relational database management system. Here’s a below code which shows you how to execute your query from SQLite database file. In create_engine you have to pass the details which are required to connect with the RDBMS like driver, dialect, host, port or even username and password.

import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Master.sqlite')
with engines.connect() as con:
     rs = con.execute("SELECT * FROM orders")
data = pd.DataFrame(rs.fetchmany(size=5))
data.columns = rs.keys()

more info on sqlalchemy module check here.

Reading data from Pickle file

It’s a file type native to Python and these files are serialized i.e. convert the object to bytestream. The pickle module implements binary protocols for serializing and de-serializing a Python object structure. Here’s a code which shows how you can import .pkl file:

import pickle
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.pkl"
with open(filename, 'rb') as file:
     data = pickle.load(file)
print(data)
Output-
{'Smith' : 23, 'John' : 22, 'Steve' : 26}

for more info check here.

Reading the data from SAS File

SAS is a file extension for an ASCII file used with Statistical Analysis Software. SAS stands for Statistical Analysis Software. SAS files contain the source code for a program or sub-program used for Advanced Analytics, Multivariate analysis, Business Intelligence, Predictive Analytics and Data management. Its a standard for computational analysis.

Here’s a code which shows how you can import .sas7bdat file:

import pandas as pd
from sas7bdat import SAS7BDAT
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.sas7bdat"
with SAS7BDAT(filename) as file:
     data = file.to_data_frame()

for more info, check here.

Reading the data from Stata File

Stata is a combination of “Statistics” + “data” and it’s widely used in academic social science and research. Here’s a code which shows how you can import .dta file:

import pandas as pd
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.dta"
data = pd.read_stata(filename)

for more info, check here.

Reading the data from HDF5 File

HDF5 stands for Hierarchical Data format version 5. It is a standard for storing a large quantity of numerical data where a dataset can be of hundreds of GB’s or TB’s. HDF5 can scale to ExaBytes. Here’s a code which shows how you can import .hdf5 file:

import h5py
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.hdf5"
data = h5py.file(filename, 'r')

for more detailed info, check here.

Reading data from MATLAB file

MATLAB stands for matrix laboratory) and it is a proprietary programming language developed by MathWorks. It’s basically an Industry standard in engineering and science and the data is saved as .mat files. Here’s a code which shows how you can import .mat file:

import scipy.io
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.mat"
mat = scipy.io.loadmat(filename)
print(type(mat))
Output-
<class 'dict'>

for more info, check here.

Reading data from HTML file

import requests
url = "https://www.wikipedia.org"
r = requests.get(url)
text = r.text

for more info, check here.

Reading data from JSON file

JSON stands for JavaScript Object Notation(JSON), which is a lightweight, text-based open standard designed for exchanging the data over the web. Here’s a code which shows how you can import .json file:

import json
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.json"
with open(filename, 'r') as json_file:
     json_data = json.load(json_file)
print(type(json_data))

To know more detailed information about reading JSON file, check here.

Reading data from Google Bigquery file

projectid = "xxxxxxxx"
data_frame = read_gbq('SELECT * FROM test_dataset.test_table', projectid)

for more info check here.

Reading data from PDF file

PDF stands for Portable Document Format, a file format developed by Adobe to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Here’s a code which shows how you can import and read the .pdf file:

# importing required modules
import PyPDF2
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.pdf"
# creating a pdf file object
pdfFileObj = open(filename, 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()

for more info, check here

.Reading data from Microsoft Word file

Microsoft Word is a widely used commercial word processor designed by Microsoft and comes as a part of the Microsoft Office suite of productivity. Similar to Excel you have seen two file extensions .doc and .docx. The .doc format was an extension until the 2003 version of Word. Later Microsoft introduced the .docx as the new default format with version 2007. The main difference between both of the extensions is that with .doc, the document is stored in a binary file that also contains the related formatting information whereas a .docx file is basically kind of a zip file that contains all the XML files related to the document.

Here’s a code which shows how you can import .docx file:

import docx
def getText(filename):
    doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)

for more info, check here.

Reading image file

As a data scientist, you surely find yourself working on image processing projects where you have to have to read or write an image file. There may be a colour image which is 3-Dimensional, having RGB value or a Grayscale image which are 2-Dimensional in nature.

Here’s a code which shows how you can import image file:

import matplotlib.image as img
import matplotlib.pyplot as plt
# Read Images
filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.png"
img = img.imread(filename)
# Output Images
plt.imshow(img)

You can check Official Pandas Documentation for other types of file formats that i haven’t covered. In this article, I have tried to show how you can read the most common file formats used in Data Science (using Python), which are used by data scientist or data analyst on daily basis.

Also, you can find other posts on Data Science here on our blog. Hope you like my post. If you have any query, feel free to use the comment section below.

List of file formats

Let’s see how to read most common file formats used in Data Science (using Python):

Related posts:

Leave a Reply Cancel reply