Pandas Notes - A Reference¶
The dataset being used here is a sample Sales dataset.
[1]:
import pandas as pd
pd.set_option('display.max_rows', 500)
df = pd.read_csv("./data/sales_data_sample.csv", encoding='Latin-1')
Get rows with NaNs in any column
[2]:
df[df.isna().any(axis=1)]
[2]:
ORDERNUMBER | QUANTITYORDERED | PRICEEACH | ORDERLINENUMBER | SALES | ORDERDATE | STATUS | QTR_ID | MONTH_ID | YEAR_ID | ... | ADDRESSLINE1 | ADDRESSLINE2 | CITY | STATE | POSTALCODE | COUNTRY | TERRITORY | CONTACTLASTNAME | CONTACTFIRSTNAME | DEALSIZE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10107 | 30 | 95.70 | 2 | 2871.00 | 2/24/2003 0:00 | Shipped | 1 | 2 | 2003 | ... | 897 Long Airport Avenue | NaN | NYC | NY | 10022 | USA | NaN | Yu | Kwai | Small |
1 | 10121 | 34 | 81.35 | 5 | 2765.90 | 5/7/2003 0:00 | Shipped | 2 | 5 | 2003 | ... | 59 rue de l'Abbaye | NaN | Reims | NaN | 51100 | France | EMEA | Henriot | Paul | Small |
2 | 10134 | 41 | 94.74 | 2 | 3884.34 | 7/1/2003 0:00 | Shipped | 3 | 7 | 2003 | ... | 27 rue du Colonel Pierre Avia | NaN | Paris | NaN | 75508 | France | EMEA | Da Cunha | Daniel | Medium |
3 | 10145 | 45 | 83.26 | 6 | 3746.70 | 8/25/2003 0:00 | Shipped | 3 | 8 | 2003 | ... | 78934 Hillside Dr. | NaN | Pasadena | CA | 90003 | USA | NaN | Young | Julie | Medium |
4 | 10159 | 49 | 100.00 | 14 | 5205.27 | 10/10/2003 0:00 | Shipped | 4 | 10 | 2003 | ... | 7734 Strong St. | NaN | San Francisco | CA | NaN | USA | NaN | Brown | Julie | Medium |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2818 | 10350 | 20 | 100.00 | 15 | 2244.40 | 12/2/2004 0:00 | Shipped | 4 | 12 | 2004 | ... | C/ Moralzarzal, 86 | NaN | Madrid | NaN | 28034 | Spain | EMEA | Freyre | Diego | Small |
2819 | 10373 | 29 | 100.00 | 1 | 3978.51 | 1/31/2005 0:00 | Shipped | 1 | 1 | 2005 | ... | Torikatu 38 | NaN | Oulu | NaN | 90110 | Finland | EMEA | Koskitalo | Pirkko | Medium |
2820 | 10386 | 43 | 100.00 | 4 | 5417.57 | 3/1/2005 0:00 | Resolved | 1 | 3 | 2005 | ... | C/ Moralzarzal, 86 | NaN | Madrid | NaN | 28034 | Spain | EMEA | Freyre | Diego | Medium |
2821 | 10397 | 34 | 62.24 | 1 | 2116.16 | 3/28/2005 0:00 | Shipped | 1 | 3 | 2005 | ... | 1 rue Alsace-Lorraine | NaN | Toulouse | NaN | 31000 | France | EMEA | Roulet | Annette | Small |
2822 | 10414 | 47 | 65.52 | 9 | 3079.44 | 5/6/2005 0:00 | On Hold | 2 | 5 | 2005 | ... | 8616 Spinnaker Dr. | NaN | Boston | MA | 51003 | USA | NaN | Yoshido | Juri | Medium |
2676 rows × 25 columns
Check non-null counts and Dtypes per column
[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORDERNUMBER 2823 non-null int64
1 QUANTITYORDERED 2823 non-null int64
2 PRICEEACH 2823 non-null float64
3 ORDERLINENUMBER 2823 non-null int64
4 SALES 2823 non-null float64
5 ORDERDATE 2823 non-null object
6 STATUS 2823 non-null object
7 QTR_ID 2823 non-null int64
8 MONTH_ID 2823 non-null int64
9 YEAR_ID 2823 non-null int64
10 PRODUCTLINE 2823 non-null object
11 MSRP 2823 non-null int64
12 PRODUCTCODE 2823 non-null object
13 CUSTOMERNAME 2823 non-null object
14 PHONE 2823 non-null object
15 ADDRESSLINE1 2823 non-null object
16 ADDRESSLINE2 302 non-null object
17 CITY 2823 non-null object
18 STATE 1337 non-null object
19 POSTALCODE 2747 non-null object
20 COUNTRY 2823 non-null object
21 TERRITORY 1749 non-null object
22 CONTACTLASTNAME 2823 non-null object
23 CONTACTFIRSTNAME 2823 non-null object
24 DEALSIZE 2823 non-null object
dtypes: float64(2), int64(7), object(16)
memory usage: 551.5+ KB
Get counts of number of Nulls per column
[4]:
df.isnull().sum().reset_index()
[4]:
index | 0 | |
---|---|---|
0 | ORDERNUMBER | 0 |
1 | QUANTITYORDERED | 0 |
2 | PRICEEACH | 0 |
3 | ORDERLINENUMBER | 0 |
4 | SALES | 0 |
5 | ORDERDATE | 0 |
6 | STATUS | 0 |
7 | QTR_ID | 0 |
8 | MONTH_ID | 0 |
9 | YEAR_ID | 0 |
10 | PRODUCTLINE | 0 |
11 | MSRP | 0 |
12 | PRODUCTCODE | 0 |
13 | CUSTOMERNAME | 0 |
14 | PHONE | 0 |
15 | ADDRESSLINE1 | 0 |
16 | ADDRESSLINE2 | 2521 |
17 | CITY | 0 |
18 | STATE | 1486 |
19 | POSTALCODE | 76 |
20 | COUNTRY | 0 |
21 | TERRITORY | 1074 |
22 | CONTACTLASTNAME | 0 |
23 | CONTACTFIRSTNAME | 0 |
24 | DEALSIZE | 0 |
Get dataframe excluding all NaNs in a particular Column
[5]:
no_nan_addresses = df[df['ADDRESSLINE2'].notnull()]
len(no_nan_addresses)
[5]:
302