DataFrame¶
You can think of DataFrame like a spreadsheet or SQL table. It is the most commonly used pandas object. Rows are referred to as axis 0 and columns as axis 1 and each axis can have multiple levels (which is not common).
Creating a DataFrame¶
Below example is to create sample DataFrame for examples on this page
import pandas as pd
df=pd.DataFrame(data=[['apple','fruit',85,3], # Natural line breaking allowed for code within parentheses()
['orange','fruit',68,1],
['banana','fruit',69,2],
['beetroot','vegetable',68,1],
['carrot','vegetable',45,2]],
columns=['name','type','health_index','cost_index'])
df # prints the dataframe
Result: {F34356671}
Datatypes & conversions¶
Available datatypes¶
object (string, lists, dict, tuple and other complex objects), int64, float64, datetime64
#string
s='hello world!' #s[0:2] returns 'he', s[:-1] returns 'hello world'
#list
a=[1,4,5,1,3,6] # a[0] gets first element; a[-1] gets last element. Lists elements can be added or removed with a.append(elem) or a.pop(elem)
#dict
d={'movie_name':'star wars', 'rating':5} # d['movie_name'] gets 'star wars'. Dict elements can be added or removed using d['release_year']=1977 or del d['rating']
#tuple
t=('apple','iphone','7 plus') # similar to list, t[0] gets first element. Tuples are immutable. cannot add/change values
#complex objects (combination of two or more of above)
c={'product':('apple','iphone','7 plus'),'features':['bluetooth','wifi']}
Get data types for each column in a data frame¶
df.dtypes # gives the number of columns by various data types in the dataframe df
Convert int column to float
You can convert each series independently using series.astype function
df['mycol'] = df.mycol.astype(float)
Convert string/object to numeric¶
Pandas auto-detects data type. Usually numeric column with nulls or few string values are treated as object. This piece of code will replace them to 0 and convert to int
df.ID = pd.to_numeric(df.ID, errors='coerce')\ # coerce converts non numeric value columns to NaN
.fillna(0)\ # replaces NaN to 0
.astype(np.int64) # type conversion
If there are no issues in your column, you can just run pd.to_numeric like shown below. It will convert to float if you have decimal values or int otherwise.
df['mycol'] = pd.to_numeric(df.mycol)
Other conversions¶
dict(obj) # converts object to dict
list(tup) # converts tuple to list
tuple(list) # converts list to tuple
Select columns¶
Simple select¶
df[col1] # Returns series
df[[col1]] # Returns dataframe with 1 col
df.loc[:,[col1]] #Same as above, : refers to all rows
df.iloc[:,[0]] # Returns dataframe with 1st col
df[[col1,col2]] # Returns dataframe with 2 cols
df.loc[:,[col1,col2]] # Same as above
df.iloc[:,[0:3]] #Retuns dataframe with first 3 columns (0,1,2)
more creative selects from a list _`here <https://teamtreehouse.com/community/how-do-i-get-the-first-4-and-last-4-items-of-a-list>`__
Select from multi-level index using tuples¶
The tuples () can have 2 or more levels .. code-block:: python
df[(‘label’,’count’),(‘label’,’mean’)] # selects 2 columns (first under tree label->count and second label->mean)
Creating multi-level index using tuples¶
The tuples () can have 2 or more levels
pd.MultiIndex.from_tuples([('Parent1', 'Child1'), ('Parent1', 'Child2')], names=['Parent', 'Child'])
Alternatively, you can construct a list and convert to tuple using zip:
arrays = [['Parent1','Parent1'],
['Child1', 'Child2']]
tuples = list(zip(*arrays))
pd.MultiIndex.from_tuples(tuples, names=['Parent', 'Child'])
Programmatically select columns¶
df[[col for col in df if 'prob' in col]] # selects all columns containing 'prob' in column name
Get distribution of values in a column
df.health_index.value_counts() # record count for each val
df.health_index.value_counts(normalize=True) # % distribution
Filter rows¶
Filters internally use masking. You can just run the filter inside to see the mask data frame.
df.loc[df.col1.isin(['apple','orange']),:] # Returns all records where col1 is 'apple' or 'orange'
df.loc[~df.col1.isin(['apple','orange']),:] # Returns all records where col1 is not 'apple' or 'orange'
Sort¶
Sort rows¶
df.sort_values('trips_per_user') #Sorts rows based on values in trips_per_user
Sort columns¶
df.sort_index(axis=1,level=0) # Sorts columns by their names. axis 1 refers to columns. level 0 is the highest/default level
Group by¶
Group by example¶
groupby returns a dict object with group name and group data frame. This can be iterated thru manually or summarized using the functions below. NULL value (NA, NaT) will be ignored and will not form a group.
df.groupby('email_domain')
df.groupby(column_list)
Tip: You could also group by on a field which is not in the selector (even from another dataframe as long as indexes match up)
df['cost_index'].groupby(df.type).sum()
Tip: Group by fields are added to index in the resulting dataframe. Use reset_index() to move the index to columns
df.groupby(column_list).sum().reset_index()
Aggregate functions¶
apply a common operation on all non grouped columns If an aggregate function cannot be applied on a column eg.string, these nuisance columns will be silently ignored
.size() # count
.nunique() # distinct count
.sum() # supports prod, min, max, mean, median, first, last, std, var
.head(5) # returns first 5 rows in each group by order as in dataframe. also supports tail(n), nlargest(n), nsmallest(n)
.nth([4,-1]) # returns 4th element from the top and 1st element from the bottom
apply various operations on each of the columns
.agg({'col1':['mean','count'],'col2':['var']})
% of total within groups
df['health_index_%oftotal']=df.health_index/df.health_index.sum()
Transform functions¶
Transform function is similar to window aggregate functions on the entire window. It aggregates the data and assigns back to dataset similar to original and preserves index
df.groupby('type')['health_index'].transform('sum') # supports all aggregate functions like min, max, mean, std, var
Group Filters (SQL:having)¶
dff.groupby('B').filter(lambda x: len(x['C']) > 2) #Refer Lambda functions
Window aggregate functions¶
Overall
Applying math function on a series aggregates to the entire group. When applied on group by, returns aggregate value for each group
df.health_index.sum()
We could use this to divide on individual values to create % of total columns
df['health_index_%oftotal']=df.health_index/df.health_index.sum()
Simple Cumulative¶
df.cost_index.cumsum(asending=False) # similar functions: cummin, cummax, cumprod
Expanding Window
Another way of achieving cumulative sum:
df.cost_index.expanding().sum()
Rolling Window¶
Similar to expanding, rolling window is also supported. Here is how to do a rolling sum of current record and previous (n-1) records
df.cost_index.rolling(2).sum()
Lead/Lag¶
df.cost_index.shift(-2) # Get value from 2 records before
df.cost_index.shift(1) # Get value from next record
Resample¶
Used for time series grouping to create bins of equi-split time intervals
# Get another example!
df_re.groupby('group').resample('1D').sum()
To check if window aggregate functions are working fine, validate alongside your original dataset like below:
pd.concat([df,df.cost_index.rolling(2,min_periods=1).mean()], axis=1)
Note: No exact equivalent of
rows between n preceding and n following
found in pandas. Please
update when you find it
Joins¶
Simple join¶
train_X.join(train[y_col]) # joins on index. Returns dataframe with all columns from train_X & y_col from train
Concat columns¶
concat on axis 1 is like join based on index
pd.concat([df1, df4], axis=1)
SQL like join¶
SQL like join can be achieved with merge:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
Aggregate and join back to original dataset¶
df.join(df.groupby('month')['A'].sum(), on='month', rsuffix='_r') # aggregates column A to month level and adds that to each of the records in df and is returned
Formatting¶
Numeric formating¶
df.style.format({'B': '{:0<4.0f}', 'D': '{:+.2%}'},'C':'±{:.2f}')
# {:+.2%} is for % with 2 decimals
# ±{:.2f} is for 2 decimals with a +- sign in the front,
# {:0<4.0f} 0 in front fills zero, < for is left aligned, 4 digits (including - sign) before decimal and no value after decimal
Numeric format guides: here
Gradient Background¶
cm = sns.light_palette('green', as_cmap=True)
df.style.background_gradient(cmap=cm, subset=['B', 'C', 'D’]) # applies heatmap for columns B, C, D
Bars¶
df.style.bar(subset=['B']) # applies bars to column B
Nesting multiple styles¶
df.style.format({'B': '{:0<4.0f}', 'D': '{:+.2f}'}).bar(subset=['B'])
More styles: here
Transform¶
Swap level & Axis¶
df.swaplevel(i=0,j=1,axis=1) # swaps first 2 levels (0 & 1) on columns (axis 1)
df.swapindex(0,1) # swaps first 2 index columns
Map¶
df['housing'].map({'yes': 1, 'no': 0}) # changes values 'yes' to 1 and 'no' to 0
Apply¶
using lambda functions for column transformations (entire column is passed as series to lambda function)
df['delayed'] = df['arr_delay'].apply(lambda col: col > 0)
using lambda function across columns using axis=1
df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)
Flatten multi-level column index¶
df.columns=df.columns.get_level_values(0)
Pivot index levels to column¶
This is like a pivot table operation. Creates N columns based on N unique values in the row index. If you need to pivot on a non-index column, first add that to index using pd.set_index(col, append=True)
df=df.unstack() # This unstacks the right most index. To be more specific use order or name like df.unstack(0) or df.unstack('Year')
# Vice versa: to pivot a column level to row index
df=df.stack()
Binning based on value (cut)¶
cut splits a continuous data into equi-distant bins. Bins are skewed when outliers in the data
pd.cut([1,2,3,4,5,6,7,8], 4, labels=False)
# Output: array([0, 0, 1, 1, 2, 2, 3, 3])
pd.cut([1,2,3,4,5,6,7,20], 4, labels=False)
# Output: array([0, 0, 0, 0, 0, 1, 1, 3])
Binning based on records (qcut)¶
qcut splits any data into equi-sized bins
pd.qcut([1,2,3,4,5,6,7,20], 4, labels=False)
# Output: array([0, 0, 1, 1, 2, 2, 3, 3])
this can be used in a group by to get decile level metrics
df.groupby(pd.qcut(df.col1,10, labels=False)).mean()
Add/Remove rows & columns¶
Add Rows¶
Adding 1 row manually
df.append({'name':'banana', 'type':'fruit', 'health_index':88, 'cost_index':98},ignore_index=True) # New index sequence will be created when ignore_index is True. Else will use original index and could result in duplicate indexes
Adding rows from another dataframe with matching columns (new columns will be created for non-matching columns. Missing columns will be NULLs
df.append(new_df)
Add columns¶
to add a column manually
df.assign('new_col',[1,2,3,4,5])
# Alternate approach
df['new_col']=[1,2,3,4,5]
# Add column dynamically creating a column based on a logic on existing column
df['new_col']=[1 if x>0 else 0 for x in df.col]
to create a new data frame with additional column
pd.concat([df,df.cost_index/2], axis=1) # this can also be used to create a dataframe from 2 or more series
Remove rows¶
Drop using index¶
df.drop([0,1]) # removes rows with index 0 & 1
Remove columns¶
drop using column names
df.drop('cost_index', axis=1)
Remove a level from multi-level index¶
drop the highest level from multi-level column index
df.columns=df.columns.droplevel()
drop a level from multi-level row index
pd.MultiIndex.droplevel(df,level=0)
Visualizations¶
Simple charts¶
df.cost_index.plot(kind='bar') # kinds: barh, hist, box, kde, density, area, scatter, hexbin, pie
df.plot(subplots=True, figsize=(6, 6), layout=(3, -1)) # Layout -1 is to infer the number of columns based on data
Histogram¶
%matplotlib inline
df.hist(bins=100, figsize=(40,30))
Boxplot for each group¶
%matplotlib inline
df.groupby('type').boxplot()
Scatter Matrix¶
scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde')
Bootstrap Plot¶
bootstrap_plot(data, size=50, samples=500, color='grey')
More vizualizations¶
Nesting them all Use for continuity on next line
data[['email_domain',y_col]].\
.groupby('email_domain')\
.agg({'label':['mean','count']})\
.sort_values([('label','count'),('label','mean')], ascending=False)\
.style\
.format({'B': '{:0<4.0f}', 'D': '{:+.2f}'})\
.bar(subset=['B'])