Data Analysis using NumPy

Oct 16, 2024 5 min read 1587 views

NumPy (Numerical Python) is an open source Python library. NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays.

At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. It also supports heterogenous array.

Why is NumPy fast?

Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. For eg: to do matrix multiplications, we just need to do

# Perform matrix multiplication
C = np.dot(A, B)

Install NumPy

Pre-requisitive of installing numpy is python. Try to install python > 3.8, as security and support life cycle covers from here.

In ubuntu, we can use apt install command to install numpy on top of python.

sudo apt install python3-numpy

In windows, we can use the below pip command.

pip install numpy

If we have conda, using that as well we can install

conda install numpy

Create Basic ndarray

Lets get started on creating basic homogenous multi-dimensionarl array and explore the properties to it.

>>> import numpy as np
>>> arr1 = np.array([[2, 3, 4],[5, 6, 6]])
>>> arr1.ndim
2
>>> arr1.shape
(2, 3)
>>> arr1.size
6
>>> arr1.dtype
dtype('int64')
>>> arr1.itemsize
8
>>> arr1.data
<memory at 0x7f185926d2b0>

ndarray.ndim

the number of axes (dimensions) of the array.

ndarray.shape

the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.

ndarray.size

the total number of elements of the array. This is equal to the product of the elements of shape.

ndarray.dtype

an object describing the type of the elements in the array. One can create or specify dtype’s using standard Python types. Additionally NumPy provides types of its own. numpy.int32, numpy.int16, and numpy.float64 are some examples.

ndarray.itemsize

the size in bytes of each element of the array. For example, an array of elements of type float64 has itemsize 8 (=64/8), while one of type complex32 has itemsize 4 (=32/8). It is equivalent to ndarray.dtype.itemsize.

ndarray.data

the buffer containing the actual elements of the array. Normally, we won’t need to use this attribute because we will access the elements in an array using indexing facilities.

Case Study

Lets take a case study for data analysis of the school marksheet, and find the important things like find the total marks or average for a particular grade. Now the school marksheet will have hetergenous array which can be created np.dtype, then add the record using np.rec.array with dtype marksheet.

We also found total mark in each grade by find the unqiue grades, then loop the grade to filter and do the sum of it.

The same logic can be used to find the first mark in each grade instead of sum to get the first mark in each grade.

From the grouped grade sum mark, loop the dict to find the max total mark happened on which grade.

import numpy as np
import operator

# Define a structured data type
marksheet = np.dtype([('Grade', 'U10'),('Section', 'U10'),('name', 'U10'), ('total', 'f4')])

# Create a structured array
schoolmarks = np.rec.array([('G1', 'A', 'Nagappan', 55.0), 
                 ('G1', 'A', 'Lavanya', 85.5),
                 ('G2', 'A', 'Krithik', 68.0)], dtype=marksheet)

print(schoolmarks)
print(schoolmarks.name)  # Accessing the 'name' field
print(schoolmarks.Grade == 'G1')   # Accessing the 'age' field

totalMarks = schoolmarks[schoolmarks.Grade == 'G1']

print('filter the marks of G1 grade', totalMarks)

print('Grade g1 total marks:', np.sum(totalMarks.total))

print('Grade g1 average marks:', np.average(totalMarks.total))

# Group by 'grade' and sum 'total'
unique_grades = np.unique(schoolmarks.Grade)
grouped_grade_sumofmarks = {grade: np.sum(schoolmarks[schoolmarks.Grade == grade].total) for grade in unique_grades}
print('total mark in each grade', grouped_grade_sumofmarks)
print(type(grouped_grade_sumofmarks))

grouped_grade_firstmarks = {grade: np.max(schoolmarks[schoolmarks.Grade == grade].total) for grade in unique_grades}
print('first mark in each grade', grouped_grade_firstmarks)


# Find the key with the maximum value
max_key = max(grouped_grade_sumofmarks.items(), key=operator.itemgetter(1))[0]

print("Grade with max total mark:", max_key, grouped_grade_sumofmarks.items())

Below is the output, if we execute the above script.

# print schoolmarks
[('G1', 'A', 'Nagappan', 55. ) 
 ('G1', 'A', 'Lavanya', 85.5)
 ('G2', 'A', 'Krithik', 68. )]

#print schoolmarks.name
['Nagappan' 'Lavanya' 'Krithik']

##print schoolmarks.Grade == 'G1'
[ True  True False]

filter the marks of G1 grade [('G1', 'A', 'Nagappan', 55. ) ('G1', 'A', 'Lavanya', 85.5)]

Grade g1 total marks: 140.5

Grade g1 average marks: 70.25

total mark in each grade {'G1': 140.5, 'G2': 68.0}

<class 'dict'>
first mark in each grade {'G1': 85.5, 'G2': 68.0}

Grade with max total mark: G1 dict_items([('G1', 140.5), ('G2', 68.0)])

DevGroves Technologies

About author

DevGroves Technologies is a IT consulting and services start-up company which is predominately to web technologies catering to static website, workflow based CRM websites, e-commerce websites and reporting websites tailoring to the customer needs. We also support open source community by writing blogs about how, why and where it need to be used for.

Data Analysis using NumPy

DevGroves Technologies

About author

Newsletter

Related Articles

Suggested keywords:

Data Analysis using NumPy

DevGroves Technologies

About author

Newsletter

Related Articles