Getting Started With Python Data Analysis Library Pandas
Exploring pandas functionality

Explore data analysis with Python. Pandas DataFrames make manipulating your data easy, from selecting or replacing columns and indices to reshaping your data.
Pandas is a popular Python package for data science, and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy, among many other things. The DataFrame is one of these structures.
let’s gets started…
1.How To Create a Pandas DataFrame
There are two core objects in pandas: the DataFrame and the Series.
Dataframe
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.
For example, consider the following simple DataFrame:
code
import pandas as pd
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
for output of this above code here is the link
DataFrame entries are not limited to integers. For instance, here’s a DataFrame whose values are strings:
code
df1 = pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
Series
A Series is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:
code
s1 = pd.Series([1, 2, 3, 4, 5])
s1
A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name :
code:
s2 = pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
2.How To Read The Data File
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.
Let’s now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We’ll use the pd.read_csv() function to read the data into a DataFrame.
code:
wine_review = pd.read_csv("C:/wine-reviews/winemag-data-130k-v2.csv")
Note: for data file you have download from kaggle
We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:
code:
wine_review.head()
3.Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc . For more advanced operations, these are the ones you’re supposed to be using.
Index-based selection:
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.
To select the first row of data in a DataFrame, we may use the following:
code:
wine_review.iloc[0]
Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.
This means that it’s marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:
code:
wine_review.iloc[:, 1]
On its own, the : operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the country column from just the first, second, and third row, we would do:
code:
wine_review.iloc[:3, 1]
Or, to select just the second and third entries, we would do:
code:
wine_review.iloc[1:3, 1]
Label-based selection:
The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it’s the data index value, not its position, which matters.
For example, to get the first entry in file, we would now do the following:
code:
wine_review.loc[0, 'country']
iloc is conceptually simpler than loc because it ignores the dataset’s indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it’s usually easier to do things using loc instead. For example, here’s one operation that’s much easier using loc :
code:
wine_review.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]
Choosing between loc and iloc:
When choosing or transitioning between loc and iloc, there is one “gotcha” worth keeping in mind, which is that the two methods use slightly different indexing schemes.
iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.
Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet] (t coming after s in the alphabet).
This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].
Otherwise, the semantics of using loc are the same as those for iloc.
4.Conditional selection:
So far we’ve been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.
For example, suppose that we’re interested specifically in better-than-average wines produced in Italy.
We can start by checking if each wine is Italian or not:
code:
wine_review.country == 'Italy'
This operation produced a Series of True/False booleans based on the country of each record. This result can then be used inside of loc to select the relevant data:
We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.
We can use the ampersand (&) to bring the two questions together:
code:
wine_review.loc[(wine_review.country == 'Italy') & (wine_review.points >= 90)]
Suppose we’ll buy any wine that’s made in Italy or which is rated above average. For this we use a pipe (|):
code:
wine_review.loc[(wine_review.country == 'Italy') | (wine_review.points >= 90)]
code:
The function notnull(). These methods let you highlight values which are (or are not) empty (NaN). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:
code:
wine_review.loc[wine_review.price.notnull()]
The excuted code in google colab for better purpose.


Comments
There are no comments for this story
Be the first to respond and start the conversation.