Data Analysis using NumPy

NumPy (Numerical Python) is an open source Python library. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays.
At the core of the NumPy package, is the ndarray
object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. It also supports heterogenous array.
Why is NumPy fast?
Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. For eg: to do matrix multiplications, we just need to do
# Perform matrix multiplication
C = np.dot(A, B)
Install NumPy
Pre-requisitive of installing numpy is python. Try to install python > 3.8, as security and support life cycle covers from here.
In ubuntu, we can use apt install command to install numpy on top of python.
sudo apt install python3-numpy
In windows, we can use the below pip command.
pip install numpy
If we have conda, using that as well we can install
conda install numpy
Create Basic ndarray
Lets get started on creating basic homogenous multi-dimensionarl array and explore the properties to it.
>>> import numpy as np
>>> arr1 = np.array([[2, 3, 4],[5, 6, 6]])
>>> arr1.ndim
2
>>> arr1.shape
(2, 3)
>>> arr1.size
6
>>> arr1.dtype
dtype('int64')
>>> arr1.itemsize
8
>>> arr1.data
<memory at 0x7f185926d2b0>
ndarray.ndim
the number of axes (dimensions) of the array.
ndarray.shape
the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.
ndarray.size
the total number of elements of the array. This is equal to the product of the elements of shape.
ndarray.dtype
an object describing the type of the elements in the array. One can create or specify dtype’s using standard Python types. Additionally NumPy provides types of its own. numpy.int32, numpy.int16, and numpy.float64 are some examples.
ndarray.itemsize
the size in bytes of each element of the array. For example, an array of elements of type float64 has itemsize 8 (=64/8), while one of type complex32 has itemsize 4 (=32/8). It is equivalent to ndarray.dtype.itemsize.
ndarray.data
the buffer containing the actual elements of the array. Normally, we won’t need to use this attribute because we will access the elements in an array using indexing facilities.
Case Study
Lets take a case study for data analysis of the school marksheet, and find the important things like find the total marks or average for a particular grade. Now the school marksheet will have hetergenous array which can be created np.dtype, then add the record using np.rec.array with dtype marksheet.
We also found total mark in each grade by find the unqiue grades, then loop the grade to filter and do the sum of it.
The same logic can be used to find the first mark in each grade instead of sum to get the first mark in each grade.
From the grouped grade sum mark, loop the dict to find the max total mark happened on which grade.
import numpy as np
import operator
# Define a structured data type
marksheet = np.dtype([('Grade', 'U10'),('Section', 'U10'),('name', 'U10'), ('total', 'f4')])
# Create a structured array
schoolmarks = np.rec.array([('G1', 'A', 'Nagappan', 55.0),
('G1', 'A', 'Lavanya', 85.5),
('G2', 'A', 'Krithik', 68.0)], dtype=marksheet)
print(schoolmarks)
print(schoolmarks.name) # Accessing the 'name' field
print(schoolmarks.Grade == 'G1') # Accessing the 'age' field
totalMarks = schoolmarks[schoolmarks.Grade == 'G1']
print('filter the marks of G1 grade', totalMarks)
print('Grade g1 total marks:', np.sum(totalMarks.total))
print('Grade g1 average marks:', np.average(totalMarks.total))
# Group by 'grade' and sum 'total'
unique_grades = np.unique(schoolmarks.Grade)
grouped_grade_sumofmarks = {grade: np.sum(schoolmarks[schoolmarks.Grade == grade].total) for grade in unique_grades}
print('total mark in each grade', grouped_grade_sumofmarks)
print(type(grouped_grade_sumofmarks))
grouped_grade_firstmarks = {grade: np.max(schoolmarks[schoolmarks.Grade == grade].total) for grade in unique_grades}
print('first mark in each grade', grouped_grade_firstmarks)
# Find the key with the maximum value
max_key = max(grouped_grade_sumofmarks.items(), key=operator.itemgetter(1))[0]
print("Grade with max total mark:", max_key, grouped_grade_sumofmarks.items())
Below is the output, if we execute the above script.
# print schoolmarks
[('G1', 'A', 'Nagappan', 55. )
('G1', 'A', 'Lavanya', 85.5)
('G2', 'A', 'Krithik', 68. )]
#print schoolmarks.name
['Nagappan' 'Lavanya' 'Krithik']
##print schoolmarks.Grade == 'G1'
[ True True False]
filter the marks of G1 grade [('G1', 'A', 'Nagappan', 55. ) ('G1', 'A', 'Lavanya', 85.5)]
Grade g1 total marks: 140.5
Grade g1 average marks: 70.25
total mark in each grade {'G1': 140.5, 'G2': 68.0}
<class 'dict'>
first mark in each grade {'G1': 85.5, 'G2': 68.0}
Grade with max total mark: G1 dict_items([('G1', 140.5), ('G2', 68.0)])