Here in this post, we will see how we can read most common file formats used in Data Science (using Python) with code.
When you are working as a Data Scientist or Data Analyst, your most of the time is utilized in reading the dataset from different sources (including data saved in different file formats and different data types), do some data wrangling to get data into shape, and then finally converting and saving it to different file formats so that it can be used to build Machine Learning models.
When as a Data Scientist you start working on any project that required dealing with structured data like Excel file, SQL etc. along with unstructured data which contain text and are not in form of rows and columns like emails, images etc. So it’s important to have a basic idea of how to read most common file formats used in Data Science using the language they are comfortable with, either Python or R.
Let’start our way towards reading most common file formats used in Data Science (using Python).
List of file formats
Here’s a below list of common file formats used in Data Science:
- CSV
- Text Files
- JSON
- Microsoft Excel File
- SAS
- SQL
- Python Pickle File
- Stata
- HDF5
- HTML
- ZIP
- DOCX
- Images
- Google Bigquery
Let’s see how to read most common file formats used in Data Science (using Python):
Reading the data from CSV file
CSV stands for comma separated values which is a text-based file format that store data in a tabular form similar to a spreadsheet or a database table and generally use a comma to separate values and has an extension of .csv. This is how a CSV file looks like when you open it with Notepad:
Here’s a code which shows how you can import CSV file:
import pandas as pd filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.csv" data = pd.read_csv(filename)
for more info, check here.
Reading the data from Microsoft Excel File
Microsoft Excel is a spreadsheet developed by Microsoft. Sometimes you will see two different file format for Excel. First is XLS which is an extension for the version of Excel prior to 2007 while XLSX is what you see nowadays created on the version of Excel 2007 and onward. The important difference between both file extensions is that XLS is a binary format while XLSX is an Open XML format.
Here’s a code which shows how you can import .xlsx file:
import pandas as pd filename = "C:\\Users\\Pankaj\\Downloads\\Testfile.xlsx" df = pd.read_excel(name, sheetname = “Test”)
for more detailed info check here.
Reading the data from Zip File
ZIP is an archive file format that supports lossless data compression. You can read a zip file by importing the “zipfile” module.
Here’s a python code which you can use to read the file that is inside the “Test.zip”.
import zipfile archive = zipfile.ZipFile('Test.zip', 'r') df = archive.read('Test.csv')
for more info on zipfile module check here.
Reading the data from SQL
SQL stands for Structured Query Language. SQL lets you managing data held in a relational database management system. Here’s a below code which shows you how to execute your query from SQLite database file. In create_engine you have to pass the details which are required to connect with the RDBMS like driver, dialect, host, port or even username and password.
import pandas as pd from sqlalchemy import create_engine engine = create_engine('sqlite:///Master.sqlite') with engines.connect() as con: rs = con.execute("SELECT * FROM orders") data = pd.DataFrame(rs.fetchmany(size=5)) data.columns = rs.keys()
more info on sqlalchemy module check here.
Reading data from Pickle file
It’s a file type native to Python and these files are serialized i.e. convert the object to bytestream. The pickle module implements binary protocols for serializing and de-serializing a Python object structure. Here’s a code which shows how you can import .pkl file:
import pickle filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.pkl" with open(filename, 'rb') as file: data = pickle.load(file) print(data) Output- {'Smith' : 23, 'John' : 22, 'Steve' : 26}
for more info check here.
Reading the data from SAS File
SAS is a file extension for an ASCII file used with Statistical Analysis Software. SAS stands for Statistical Analysis Software. SAS files contain the source code for a program or sub-program used for Advanced Analytics, Multivariate analysis, Business Intelligence, Predictive Analytics and Data management. Its a standard for computational analysis.
Here’s a code which shows how you can import .sas7bdat file:
import pandas as pd from sas7bdat import SAS7BDAT filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.sas7bdat" with SAS7BDAT(filename) as file: data = file.to_data_frame()
for more info, check here.
Reading the data from Stata File
Stata is a combination of “Statistics” + “data” and it’s widely used in academic social science and research. Here’s a code which shows how you can import .dta file:
import pandas as pd filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.dta" data = pd.read_stata(filename)
for more info, check here.
Reading the data from HDF5 File
HDF5 stands for Hierarchical Data format version 5. It is a standard for storing a large quantity of numerical data where a dataset can be of hundreds of GB’s or TB’s. HDF5 can scale to ExaBytes. Here’s a code which shows how you can import .hdf5 file:
import h5py filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.hdf5" data = h5py.file(filename, 'r')
for more detailed info, check here.
Reading data from MATLAB file
MATLAB stands for matrix laboratory) and it is a proprietary programming language developed by MathWorks. It’s basically an Industry standard in engineering and science and the data is saved as .mat files. Here’s a code which shows how you can import .mat file:
import scipy.io filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.mat" mat = scipy.io.loadmat(filename) print(type(mat)) Output- <class 'dict'>
for more info, check here.
Reading data from HTML file
import requests url = "https://www.wikipedia.org" r = requests.get(url) text = r.text
for more info, check here.
Reading data from JSON file
JSON stands for JavaScript Object Notation(JSON), which is a lightweight, text-based open standard designed for exchanging the data over the web. Here’s a code which shows how you can import .json file:
import json filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.json" with open(filename, 'r') as json_file: json_data = json.load(json_file) print(type(json_data))
To know more detailed information about reading JSON file, check here.
Reading data from Google Bigquery file
projectid = "xxxxxxxx" data_frame = read_gbq('SELECT * FROM test_dataset.test_table', projectid)
for more info check here.
Reading data from PDF file
PDF stands for Portable Document Format, a file format developed by Adobe to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. Here’s a code which shows how you can import and read the .pdf file:
# importing required modules import PyPDF2 filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.pdf" # creating a pdf file object pdfFileObj = open(filename, 'rb') # creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # creating a page object pageObj = pdfReader.getPage(0) # extracting text from page print(pageObj.extractText()) # closing the pdf file object pdfFileObj.close()
for more info, check here
.Reading data from Microsoft Word file
Microsoft Word is a widely used commercial word processor designed by Microsoft and comes as a part of the Microsoft Office suite of productivity. Similar to Excel you have seen two file extensions .doc and .docx. The .doc format was an extension until the 2003 version of Word. Later Microsoft introduced the .docx as the new default format with version 2007. The main difference between both of the extensions is that with .doc, the document is stored in a binary file that also contains the related formatting information whereas a .docx file is basically kind of a zip file that contains all the XML files related to the document.
Here’s a code which shows how you can import .docx file:
import docx def getText(filename): doc = docx.Document(filename) fullText = [] for para in doc.paragraphs: fullText.append(para.text) return '\n'.join(fullText)
for more info, check here.
Reading image file
As a data scientist, you surely find yourself working on image processing projects where you have to have to read or write an image file. There may be a colour image which is 3-Dimensional, having RGB value or a Grayscale image which are 2-Dimensional in nature.
Here’s a code which shows how you can import image file:
import matplotlib.image as img import matplotlib.pyplot as plt # Read Images filename ="C:\\Users\\Pankaj\\Downloads\\Testfile.png" img = img.imread(filename) # Output Images plt.imshow(img)
You can check Official Pandas Documentation for other types of file formats that i haven’t covered. In this article, I have tried to show how you can read the most common file formats used in Data Science (using Python), which are used by data scientist or data analyst on daily basis.
Also, you can find other posts on Data Science here on our blog. Hope you like my post. If you have any query, feel free to use the comment section below.
Leave a Reply