Python water quality EDA and Potability analysis | by James McNeill | Jul, 2023


Understanding the data

Firstly, we need to understand the data that we are working with. As the file format is a csv file, the standard pandas import statement using read_csv will be used.

# Import the dataset for review as a DataFrame
df = pd.read_csv("../input/water-potability/water_potability.csv")

# Review the first five observations
df.head()

Having imported the data, the code assigns the variable df with the DataFrame output results from the pandas method.

As with any dataset that you will process, reviewing a sample of records will help you to gain comfort. A DataFrame has a large number of methods associated with it, with the pandas API a great resource to use. Within the API a head method can be used. Output 1.1 shows the first 5 rows of the DataFrame by default. In order to produce a larger number of rows to be displayed a numeric value would be required inside the parenthesis. Two alternatives could be applied to sample the DataFrame with i) sample (df.sample()) selecting random rows from the index, or ii) tail (df.tail()) selecting the last n rows from the index.

Output 1.1 First five record details from the DataFrame

When running any method, the parenthesis is included after the method name allowing the Python interpreter to produce the result.

Displaying the memory of a DataFrame can be a common task, particularly when memory constraints are involved. An example is where the dataset to import is potentially larger than the memory available within the Python session. By using the pandas library a DataFrame is created in-memory so users should understand what memory can be used when performing these processing steps.

# Display information about the DataFrame - contains memory details
df.info(memory_usage="deep")

The code above can be used as a method to display output 1.2. With the inclusion of the keyword memory_usage, the Python interpreter is forced to do a deeper search to understand the memory usage that is displayed below. A default option would perform a general search to understand, so if accuracy in your assessment is required then ensure that the keyword phrase from above is applied.

Output 1.2 Provides an overview of the features and details of memory usage

From the results shown in output 1.2, it can show a range of details, from the column names and data types, to also confirming the class of the variable and number of non-null values. We can see that 3,276 rows are shown within the entire table. However, for the column Sulfate, there are only 2,495 non-null values present. Therefore, a number of missing values can be reviewed to understand if there is a pattern for these missing entries with other columns. We will review a data visualization technique later in the article that can help with pattern recognition.

Following the earlier import statement, users could have adjusted the Dtype of a column if the default options were not what was expected. The results above display that for decimal numbers the float Dtype is applied, with the whole number showing int. Also, the largest byte memory type for these numeric columns has been included in order to provide the full coverage of potential input values. Many times users should assess if these Dtypes are holding the correct range of values and if a smaller range is expected going forward then a smaller byte value could be assigned. Applying this logic would help to increase the memory efficiency of the DataFrame and aid with performance when processing.

One feature shown by the info method above that can be reviewed by a number of other methods is the structure of the DataFrame. Such metadata can allow programmers to review basic components of the number of rows and columns.

# Shape of the DataFrame - shows tuple of (#Rows, #Columns)
print(df.shape)
# Find the number of rows within a DataFrame
print(len(df))
# Extracting information from the shape tuple
print(f'Number of rows: {df.shape[0]} \nNumber of columns: {df.shape[1]}')

When calling an attribute in Python such as shape, there will be no parenthesis required. An attribute is a data result that can be accessed by both a class and its object. Earlier we reviewed a method which is a function that is contained within a class. For further insights on the smaller details a deep dive into how Python class statements function would be required. However, we can continue with the code that is used and show that with output 1.3 a number of values have been displayed.

Output 1.3 Metadata showing the structure of the DataFrame

The first row shows the shape output which is a tuple, that is represented by a parenthesis with two values. From the code shown above we are able to access the relative positions within this tuple to display the first and second position values. As Python uses a 0 indexing convention, the first value will be returned by applying the 0 inside square brackets. We can see that the tuple contained the number of rows in the first position, followed by the number of columns in the second. An alternative method to find the number of rows would be to use the function len, which displays the length of the DataFrame.



Source link

Leave a Comment