Table of Contents

Pandas

Useful Syntax

  • irises.describe()
  • irises[‘sepal length’].head()
  • irises[[“sepal length”, “petal width”]].head()
  • irises.iloc[5:10]
  • irises[irises[“sepal length”] > 5.0]
  • irises[“sepal length”].max()
  • irises[‘species’].unique()
  • irises.sort_values(by=”sepal length”, ascending=False).head(1)
  • irises.sort_values(by=”sepal length”, ascending=False).iloc[5:10]
  • irises.sort_values(by=”sepal length”, ascending=False).loc[5:10]
  • irises[‘x’] = 3.14
  • irises.rename(columns={‘species’: ‘type’})
  • del irises[‘x’]
  • c.copy()
  • apply() .apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
  • G[‘prevalence’] = G[‘cases’] / G[‘population’]
  • df.columns.difference(col_vals)
  • pd.concat(melted_sections)
  • df_v = df[df[key] == v]
  • df = pd.read_sql_query(''' select count(*) as interactions, count(distinct user_id) as users, count(distinct book_id) as books, sum(is_read) reads, sum(rating > 0) ratings, sum(is_reviewed) reviews from interactions ''', conn)
  • ubdf.merge(ubdf, on='book_id')\ .query('user_id_x != user_id_y')\ .groupby(['user_id_x', 'user_id_y'], as_index=False).count()\ .rename(columns={'book_id':'count'})\ .query(f'count >= {threshold}')
  • d = df.to_dict(‘index’)[0] #dataframe to dictionary, map index and first value
  • helper(**d) #unpack dictionary, it makes a list of the values (like dict.values?)
  • A[:, i:i+1] slicing a column as a vector in numpy matrix
  • samples = choice(len(X), size = k, replace=False) np.random.choice(array, size=x, replace=False)
  • df[column] = df[column].mean(), .mode()
  • df[column].fillna(new_val)
  • np.argwhere(y==0).flatten (flattens 2d array)
    • np.argwhere(np.isfinite(arr))[-1] = get last non-NaN index
  • df = df - df[drop_cols]
  • df = df.drop(drop_cols, errors=”ignore”, axis=1)
  • reshape((-2))
  • if (not_missing > missing).any()
  • list(product(*values))

Pandas Troubleshooting:

def pandas_troubleshoot(a,b):  # pass in the two dfs to compare as variables a and b
    from pandas.testing import assert_frame_equal
    
    def canonicalize_tibble(X, remove_index=True):
        var_names = sorted(X.columns)
        Y = X[var_names].copy()
        Y.sort_values(by=var_names, inplace=remove_index)
        Y.reset_index(drop=remove_index, inplace=remove_index)
        return Y
    
    a_hat = canonicalize_tibble(a)
    b_hat = canonicalize_tibble(b)
    try:
        assert_frame_equal(a_hat, b_hat, check_dtype=True,
                           check_index_type='equiv', check_column_type='equiv',
                           check_frame_type=True, check_names=True, 
                           by_blocks=False, check_exact=False, 
                           check_datetimelike_compat=False, 
                           check_categorical=True, check_like=False,
                           check_freq=True, check_flags=True,
                           rtol=1e-05, atol=1e-08, obj='DataFrame')
        print('True')
        return True
    except AssertionError as e:
        print(e)
        return e

##Joins:

  • Merge: C = A.merge (B, on=['country', 'year'])
  • Inner-join (A, B) (default): Keep only rows of A and B where the on-keys match in both. ONLY MATCH
  • Outer-join (A, B): Keep all rows of both frames, but merge rows when the on-keys match. For non-matches, fill in missing values with not-a-number (NaN) values. KEEP ALL, FILL BLANKS
  • Left-join (A, B): Keep all rows of A. Only merge rows of B whose on-keys match A.
  • Right-join (A, B): Keep all rows of B. Only merge rows of A whose on-keys match B. C = A.merge (B, on=['country', 'year'], how='right')

Reminders:

  • Dont forget to make a copy of every single df!!!
  • One way to convert pandas df to dict, is to set one column (key) as index and then use dict(df.iloc[:, x], where x is the column (value) index
  • For timestamp objects, you can do arithmetic operations such as + or -
  • If you get the error, “referenced before assignment”, create a variable for it in the beginning. See problem 21 ex 4 or problem 24 (existing_journal).
  • To convert a df to an sql table: conn = dfs_to_conn({‘my_table’: my_df}) cur = conn.cursor() cur.execute(‘select A, B, C from my_table’) result = cur.fetchall() conn.close() Convert list of tuples (ans) back to a df
  • To access a row, use .loc or .iloc
  • When you have an optional parameter, e.g. X = None, making a variable (or assigning the result of the if clause back to the variable) will help (Prob 24, Ex 4)
  • To return the opposite of a mask, you can use - or ~ .map can be used to pass in a dict(key will be mapped on to values) Prob 20, Ex 4