NumPy Explained: The Heart of Data Analysis in Python – Part II
Introduction
In the last part of NumPy Explained you have read about NumPy, Why NumPy, How to Import NumPy, Different Methods of Creating NumPy Array, and Array Manipulation methods. Now let’s move forward and learn much more about one of the core libraries in data analysis NumPy.
Data Types in NumPy
We all know about the data types in Python. Python has basically 5 types of data types which are string, integer, float, boolean, and complex. But NumPy offers a much wider range of data types which are as follows:
- Integer (i) - integers are whole numbers, either positive or negative.
- Boolean (b) - boolean represents true or false values.
- Unsigned Integer (u) - unsigned integers are non-negative whole numbers.
- Float (f) - float represents decimal numbers.
- Complex (c) - complex numbers have a real and an imaginary part.
- Timedelta (m) - timedelta represents differences in time, such as days, hours, and minutes.
- Datetime (M) - datetime represents dates and times.
- Object (O) - an object can hold any Python object, making it a flexible data type.
- String (S) - string represents text data.
- Unicode String (U) - unicode string represent text in various character encodings.
- Fixed chunk of memory for other types (void) (v) - void are structured data types that can hold multiple elements with different data types.
Here are the examples of the above-given data types.
1import numpy as np
2
3# Integer Data Type (int)
4int_arr = np.array([1, 2, 3, 4], dtype=np.int32)
5print(int_arr.dtype)
6
7# Boolean Data Type (bool)
8bool_arr = np.array([True, False, True], dtype=np.bool)
9print(bool_arr.dtype)
10
11# Float Data Type (float)
12float_arr = np.array([1.0, 2.5, 3.7], dtype=np.float64)
13print(float_arr.dtype)
14
15# Complex Data Type (complex)
16complex_arr = np.array([1 + 2j, 2 - 3j], dtype=np.complex128)
17print(complex_arr.dtype)
18
19# Datetime Data Type (datetime64)
20datetime_arr = np.array(['2023-09-13', '2023-09-14'], dtype=np.datetime64)
21print(datetime_arr.dtype)
22
23# Object Data Type (object)
24object_arr = np.array(["Hello", 123, True], dtype=np.object)
25print(object_arr.dtype)
26
27# String Data Type (string)
28string_arr = np.array(["apple", "banana", "cherry"], dtype=np.string_)
29print(string_arr.dtype)
30
31# Timedelta Data Type (timedelta64)
32timedelta_arr = np.array([np.timedelta64(3, 'D'), np.timedelta64(5, 'h')], dtype=np.timedelta64)
33print(timedelta_arr.dtype)
34
35# Unsigned Integer Data Type (uint)
36uint_arr = np.array([10, 20, 30], dtype=np.uint32)
37print(uint_arr.dtype)
38
39# Unicode String Data Type (Unicode)
40unicode_arr = np.array([u"你好", u"こんにちは"], dtype=np.unicode_)
41print(unicode_arr.dtype)
42
43# Void Data Type (void)
44void_arr = np.array([(1, 'apple'), (2, 'banana')], dtype=[('id', np.int32), ('fruit', np.string_)])
45print(void_arr.dtype)
Copy and View
Creating copies of data, arrays, or variables is essential for data analysts. NumPy provides two methods to create copies of arrays: Copy and View. Understanding the difference between these two is important for managing memory and avoiding unintended side effects. Let's explore both concepts with examples:
Copy: A copy of an array is a new array with a completely independent data and memory allocation. It creates an independent copy of the original array. Modifying a copy does not affect the original array, and vice versa. We can say it is a 'copy by value method'.
1# Create an original NumPy array
2original_array = np.array([1, 2, 3, 4, 5])
3
4# Create a copy of the original array
5copied_array = original_array.copy()
6
7# Check the copied and original arrays
8print(original_array)
9print(copied_array)
10
11# Modify the copied array
12copied_array[0] = 100
13print(copied_array)
In this example, modifying the copied_array
did not affect the original_array
.
View: A view of an array is a new array that shares the same data and memory allocation as the original array. It is just a reference to the original array. Modifying the view will also affect the original array, and vice versa. We can say it is a 'copy by reference method'.
1# Create an original NumPy array
2original_array = np.array([1, 2, 3, 4, 5])
3
4# Create a view of the original array
5view_array = original_array.view()
6
7# Check the original and view arrays
8print(original_array)
9print(view_array)
10
11# Modify the view array
12view_array[0] = 100
13print(view_array)
Here, modification to the view_array
also reflected on the original_array
.
Array Iterating
Array iteration is crucial for data processing, manipulation, and analysis with efficiency. NumPy's built-in capabilities make it an excellent choice for handling large datasets. By understanding NumPy's iteration techniques, you can optimize your code for performance and readability. NumPy provides several methods for iterating through an array.
For Loop: Traditional for loops
can be used to iterate through NumPy arrays. However, this approach is often less efficient than other array iteration methods.
1for_arr = np.array([1, 2, 3, 4, 5])
2
3for element in for_arr:
4 print(element)
But this for loop
can be very complex when we work on 3d or above-dimension arrays.
1for_arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
2
3for x in for_arr:
4 for y in x:
5 for z in y:
6 print(z)
Here you can see, we used 1 level deep loop for 1d array
, 3 levels deep for 3d array
and we have to use an ‘n’ level deep loop for n-dimension array
. It can create hell for analysts.
nditer
: The function nditer
is an efficient function that can be used from 1d array to an n-dimension array. It solves some basic to advanced issues that we face in iteration. It is an efficient multidimensional iterator object using which it is possible to iterate over an array. Each element of an array is visited using Python’s standard Iterator interface.
1iter_arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
2for x in np.nditer(iter_arr):
3 print(x)
np.nditer()
is efficient and allows you to specify different orderings (C-order or Fortran-order) and iteration flags for more complex use cases.
ndenumerate
: This function is much better than for loop. ndenumerate
solves the multi-loop issue we face and also gives the index of the element.
1enumerate_arr = np.array([1, 2, 3])
2
3for idx, x in np.ndenumerate(enumerate_arr):
4 print(idx, x)
5
6arr_2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
7
8for idx, x in np.ndenumerate(enumerate_arr):
9 print(idx, x)
Joining Array
Joining arrays is essential for tasks such as data preprocessing, feature engineering, and combining datasets. NumPy's array manipulation capabilities make it a preferred choice for handling such operations efficiently.
Concatenate: Concatenate involves combining arrays along specified axes. It is useful for joining arrays with compatible shapes. We can join arrays on axis ‘0’ or ‘columns’ and also on axis ‘1’ or ‘row’.
1arr1 = np.array([1, 2, 3])
2
3arr2 = np.array([4, 5, 6])
4
5result = np.concatenate((arr1, arr2))
6
7print(result)
We don’t need to use the axis argument while concatenating the 1d array. Let’s see on 2d array.
1arr1 = np.array([[1, 2], [3, 4]])
2arr2 = np.array([[5, 6]])
3
4# Concatenate arr2 to the bottom of arr1
5result = np.concatenate((arr1, arr2), axis=0)
6print(result)
Stacking: Stacking arrays is similar to concatenation, but it creates a new dimension in the result. You can stack arrays vertically (along rows) or horizontally (along columns). We have stack, hstack, vstack, and dstack.
Stack: stack
is simply stacking an array on the given axis. It can stack on axis 0 or 1.
1arr1 = np.array([1, 2, 3])
2
3arr2 = np.array([4, 5, 6])
4
5stack_arr = np.stack((arr1, arr2), axis=1)
6
7print(stack_arr)
HStack: hstack
helps in stacking arrays along rows.
1arr1 = np.array([1, 2, 3])
2
3arr2 = np.array([4, 5, 6])
4
5hstack_arr = np.hstack((arr1, arr2))
6
7print(hstack_arr)
VStack: vstack
helps in stacking arrays along columns.
1arr1 = np.array([1, 2, 3])
2
3arr2 = np.array([4, 5, 6])
4
5vstack_arr = np.vstack((arr1, arr2))
6
7print(vstack_arr)
DStack: dstack
is used to stack along height, which is the same as depth.
1arr1 = np.array([1, 2, 3])
2
3arr2 = np.array([4, 5, 6])
4
5dstack_arr = np.dstack((arr1, arr2))
6
7print(dstack_arr)
Merging: Merging arrays combine them based on a common key or column, similar to database joins. The np.merge
or np.join
functions are used for merging arrays. Merging arrays are typically used when you have datasets with common columns or keys. NumPy provides the np.merge()
function for merging arrays based on specified keys.
1arr1 = np.array([(1, 'Alice'), (2, 'Bob')], dtype=[('id', int), ('name', 'U10')])
2
3arr2 = np.array([(2, 'Charlie'), (3, 'David')], dtype=[('id', int), ('name', 'U10')])
4
5# Merge arr1 and arr2 based on the 'id' key
6merged_arr = np.merge(arr1, arr2, on='id', how='outer')
7
8print(merged_arr)
Array Splitting
Array splitting is an essential operation in data preprocessing and analysis. It allows you to divide large datasets into manageable chunks, extract specific portions for analysis, or prepare data for machine learning tasks such as training and testing.
NumPy's array-splitting functions make these tasks more efficient and readable by providing versatile ways to split arrays along different axes.
Split: np.split()
Splits an array into multiple subarrays of equal size along a specified axis. This method requires an even number of elements to split equally else it will throw an error.
1arr = np.array([1, 2, 3, 4, 5, 6])
2
3# Split the array into three equal parts along the first axis (axis=0)
4subarrays = np.split(arr, 3)
5
6print(subarray)
Array Split: np.array_split()
Splits an array into subarrays along a specified axis, but allows for unequal-sized subarrays. This is similar to the split method with the advantage of providing the odd number of elements.
1arr = np.array([1, 2, 3, 4, 5, 6, 7])
2
3# Split the array into four subarrays of unequal sizes
4subarrays = np.array_split(arr, 4)
5
6print(subarray)
HSplit: The np.hsplit()
function splits an array horizontally (column-wise).
1arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
2
3# Split the 2D array into two subarrays along columns
4subarrays = np.hsplit(arr_2d, 2)
5
6print(subarray)
VSplit: On the other hand, np.vsplit()
splits an array vertically (row-wise).
1arr_2d = np.array([[1, 2], [3, 4], [5, 6]])
2
3# Split the 2D array into two subarrays along rows
4subarrays = np.vsplit(arr_2d, 2)
5
6print(subarray)
DSplit: On the other hand, np.dsplit()
splits an array height or depth. the dsplit
method requires at least a 3d array. It will throw an error on an array of less than 3 dimensions.
1arr_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
2
3# Split the 3D array into two subarrays along rows
4subarrays = np.vsplit(arr_3d, 2)
5
6print(subarray)
Array Search
Searching for specific elements or elements based on some conditions within NumPy arrays is a common task in data manipulation and analysis. NumPy offers powerful and efficient tools for searching through arrays, enabling you to locate elements, filter data, and perform advanced searches. Let's begin with the fundamental task of searching for specific elements within a NumPy array.
Using Boolean Indexing: Boolean indexing allows you to create a layer of boolean that specifies which elements meet a particular condition.
1arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
2
3# Search for elements greater than 5
4result = arr[arr > 5]
5
6print(result)
np.where()
: The np.where()
function returns the indices of elements that satisfy a given condition. You can use it to search for elements and obtain their positions. np.where()
method is also used to broadcast the value in the array.
1arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
2
3# Find indices of even elements
4indices = np.where(arr % 2 == 0)
5print(indices)
6
7# Broadcast 0 to the odd values
8indices = np.where(arr % 2 == 1, 0, arr)
9print(indices)
np.searchsorted()
: The np.searchsorted()
function is useful for finding the indices where elements should be inserted to maintain array order. This is especially helpful when searching within sorted arrays.
1arr = np.array([1, 3, 4, 8, 9])
2
3# Find the index where 6 should be inserted to maintain array order
4index = np.searchsorted(arr, 6)
5
6print(index)
Array Sort
Sorting allows you to arrange data in a meaningful order, making it easier to analyze, visualize, and process. NumPy's array sorting capabilities are crucial for various data manipulation tasks, such as finding the minimum or maximum values, identifying outliers, and organizing data for visualization.
np.sort
: The np.sort()
function returns the sorted version of the array.
1arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])
2
3# Sorting array
4sorted_arr = np.sort(arr)
5
6print(sorted_arr)
np.argsort
: The np.argsort()
function returns the indices that would sort the array. This is useful when you want to access the original array in sorted order.
1arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5])
2
3sorted_indices = np.argsort(arr)
4
5sorted_arr = arr[sorted_indices]
6
7print(arr)
8print(sorted_arr)
9print(sorted_indices)
Sorting Structured Arrays by Columns: NumPy allows you to sort structured arrays by specific fields or columns.
1data = np.array([(3, 'Max'), (1, 'Luffy'), (4, 'Lucy'), (2, 'David')],
2 dtype=[('age', int), ('name', 'U10')])
3
4sorted_data = np.sort(data, order='age')
5
6print(data)
7
8print(sorted_data)
Conclusion
In this final installment of our "NumPy Explained" series, we've delved even deeper into NumPy. Throughout this series, we've explored the diverse world of NumPy, starting with its essential features, such as Importing and Creating arrays, Array Iteration, Joining Arrays, and much more. As we conclude this series, we hope you've gained a solid understanding of NumPy's capabilities and how it forms the foundation of data analysis in Python. It's not just a library; it's the heart of data analysis, offering the efficiency and flexibility needed to tackle real-world data challenges. Thank you for joining us on this NumPy exploration. Stay curious, keep coding, and embrace the power of Python and NumPy in your data analysis endeavors.