Pandas Notes - A Reference¶

The dataset being used here is a sample Sales dataset.

[1]:
import pandas as pd
pd.set_option('display.max_rows', 500)

df = pd.read_csv("./data/sales_data_sample.csv", encoding='Latin-1')
  • Get rows with NaNs in any column

[2]:
df[df.isna().any(axis=1)]
[2]:
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE STATUS QTR_ID MONTH_ID YEAR_ID ... ADDRESSLINE1 ADDRESSLINE2 CITY STATE POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME CONTACTFIRSTNAME DEALSIZE
0 10107 30 95.70 2 2871.00 2/24/2003 0:00 Shipped 1 2 2003 ... 897 Long Airport Avenue NaN NYC NY 10022 USA NaN Yu Kwai Small
1 10121 34 81.35 5 2765.90 5/7/2003 0:00 Shipped 2 5 2003 ... 59 rue de l'Abbaye NaN Reims NaN 51100 France EMEA Henriot Paul Small
2 10134 41 94.74 2 3884.34 7/1/2003 0:00 Shipped 3 7 2003 ... 27 rue du Colonel Pierre Avia NaN Paris NaN 75508 France EMEA Da Cunha Daniel Medium
3 10145 45 83.26 6 3746.70 8/25/2003 0:00 Shipped 3 8 2003 ... 78934 Hillside Dr. NaN Pasadena CA 90003 USA NaN Young Julie Medium
4 10159 49 100.00 14 5205.27 10/10/2003 0:00 Shipped 4 10 2003 ... 7734 Strong St. NaN San Francisco CA NaN USA NaN Brown Julie Medium
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2818 10350 20 100.00 15 2244.40 12/2/2004 0:00 Shipped 4 12 2004 ... C/ Moralzarzal, 86 NaN Madrid NaN 28034 Spain EMEA Freyre Diego Small
2819 10373 29 100.00 1 3978.51 1/31/2005 0:00 Shipped 1 1 2005 ... Torikatu 38 NaN Oulu NaN 90110 Finland EMEA Koskitalo Pirkko Medium
2820 10386 43 100.00 4 5417.57 3/1/2005 0:00 Resolved 1 3 2005 ... C/ Moralzarzal, 86 NaN Madrid NaN 28034 Spain EMEA Freyre Diego Medium
2821 10397 34 62.24 1 2116.16 3/28/2005 0:00 Shipped 1 3 2005 ... 1 rue Alsace-Lorraine NaN Toulouse NaN 31000 France EMEA Roulet Annette Small
2822 10414 47 65.52 9 3079.44 5/6/2005 0:00 On Hold 2 5 2005 ... 8616 Spinnaker Dr. NaN Boston MA 51003 USA NaN Yoshido Juri Medium

2676 rows × 25 columns

  • Check non-null counts and Dtypes per column

[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   ORDERNUMBER       2823 non-null   int64
 1   QUANTITYORDERED   2823 non-null   int64
 2   PRICEEACH         2823 non-null   float64
 3   ORDERLINENUMBER   2823 non-null   int64
 4   SALES             2823 non-null   float64
 5   ORDERDATE         2823 non-null   object
 6   STATUS            2823 non-null   object
 7   QTR_ID            2823 non-null   int64
 8   MONTH_ID          2823 non-null   int64
 9   YEAR_ID           2823 non-null   int64
 10  PRODUCTLINE       2823 non-null   object
 11  MSRP              2823 non-null   int64
 12  PRODUCTCODE       2823 non-null   object
 13  CUSTOMERNAME      2823 non-null   object
 14  PHONE             2823 non-null   object
 15  ADDRESSLINE1      2823 non-null   object
 16  ADDRESSLINE2      302 non-null    object
 17  CITY              2823 non-null   object
 18  STATE             1337 non-null   object
 19  POSTALCODE        2747 non-null   object
 20  COUNTRY           2823 non-null   object
 21  TERRITORY         1749 non-null   object
 22  CONTACTLASTNAME   2823 non-null   object
 23  CONTACTFIRSTNAME  2823 non-null   object
 24  DEALSIZE          2823 non-null   object
dtypes: float64(2), int64(7), object(16)
memory usage: 551.5+ KB
  • Get counts of number of Nulls per column

[4]:
df.isnull().sum().reset_index()
[4]:
index 0
0 ORDERNUMBER 0
1 QUANTITYORDERED 0
2 PRICEEACH 0
3 ORDERLINENUMBER 0
4 SALES 0
5 ORDERDATE 0
6 STATUS 0
7 QTR_ID 0
8 MONTH_ID 0
9 YEAR_ID 0
10 PRODUCTLINE 0
11 MSRP 0
12 PRODUCTCODE 0
13 CUSTOMERNAME 0
14 PHONE 0
15 ADDRESSLINE1 0
16 ADDRESSLINE2 2521
17 CITY 0
18 STATE 1486
19 POSTALCODE 76
20 COUNTRY 0
21 TERRITORY 1074
22 CONTACTLASTNAME 0
23 CONTACTFIRSTNAME 0
24 DEALSIZE 0
  • Get dataframe excluding all NaNs in a particular Column

[5]:
no_nan_addresses = df[df['ADDRESSLINE2'].notnull()]

len(no_nan_addresses)
[5]:
302