Python Functions Basics
Functions Overview
value_counts()sort_values()groupby()agg()pivot_table()set_index()loc[]sort_index()
Detailed Breakdown
1. value_counts()
df['column_name'].value_counts(normalize=True, sort=True)
Key points:
- Used on a Series, not a DataFrame
normalize=Truegives proportions instead of countssort=Trueis default, sorts by frequency descending
Common mistakes:
- Trying to use on entire DataFrame instead of a single column
- Forgetting that it returns a Series, not a DataFrame
2. sort_values()
df.sort_values(by='column_name', ascending=False)
Key points:
- Can sort by multiple columns using a list:
by=['col1', 'col2'] ascending=Falsefor descending order- Returns a new DataFrame unless
inplace=True
Common mistakes:
- Forgetting to assign the result to a variable if not using
inplace=True - Using wrong column names
3. groupby()
df.groupby('column_name')['value_column'].operation()
Key points:
- Groups data by one or more columns
- Often followed by an aggregation function
- Can group by multiple columns:
groupby(['col1', 'col2'])
Common mistakes:
- Forgetting to specify which column to perform operations on after grouping
- Not chaining an aggregation method after
groupby()
4. agg()
df.groupby('column_name').agg({'col1': 'mean', 'col2': 'sum'})
Key points:
- Can apply multiple functions to different columns
- Works well with
groupby() - Can use custom functions or numpy functions
Common mistakes:
- Incorrect syntax when specifying multiple functions
- Forgetting to import numpy if using numpy functions
5. pivot_table()
pd.pivot_table(df, values='value_col', index='index_col', columns='column_col', aggfunc='mean', fill_value=0)
Key points:
- Creates a spreadsheet-style pivot table
aggfuncdefaults to mean if not specified- Can use multiple columns for index or columns
Common mistakes:
- Confusing the roles of index, columns, and values parameters
- Forgetting to specify
aggfuncwhen you want something other than mean
6. set_index()
df.set_index(['col1', 'col2'])
Key points:
- Creates a new index for the DataFrame
- Can create a MultiIndex by passing a list of column names
- Returns a new DataFrame unless
inplace=True
Common mistakes:
- Forgetting to assign the result to a variable if not using
inplace=True - Not realizing that the columns used for the index are removed from the DataFrame body
7. loc[]
df.loc[row_indexer, column_indexer]
Key points:
- Uses labels/index values for selection
- Can select both rows and columns
- For slicing on MultiIndex, use tuples:
df.loc[('country', 'city'):, :]
Common mistakes:
- Using integer positions instead of labels (that's what
iloc[]is for) - Incorrect syntax for MultiIndex slicing
- Forgetting the comma when selecting all rows or all columns (e.g.,
df.loc[:, 'column'])
8. sort_index()
df.sort_index(level=['col1', 'col2'], ascending=[True, False])
Key points:
- Sorts the DataFrame by the index
- For MultiIndex, use
levelparameter to specify which levels to sort by - Can sort different levels in different directions using
ascendingparameter
Common mistakes:
- Confusing
sort_index()withsort_values() - Incorrect syntax for sorting multiple levels of a MultiIndex
- Forgetting that it returns a new DataFrame unless
inplace=True
Remember, practice makes perfect! Keep working with these functions, and soon they'll become second nature. Happy coding!