Computing for Data Analytics
Table of Contents
Pandas
Useful Syntax
- irises.describe()
- irises[‘sepal length’].head()
- irises[[“sepal length”, “petal width”]].head()
- irises.iloc[5:10]
- irises[irises[“sepal length”] > 5.0]
- irises[“sepal length”].max()
- irises[‘species’].unique()
- irises.sort_values(by=”sepal length”, ascending=False).head(1)
- irises.sort_values(by=”sepal length”, ascending=False).iloc[5:10]
- irises.sort_values(by=”sepal length”, ascending=False).loc[5:10]
- irises[‘x’] = 3.14
- irises.rename(columns={‘species’: ‘type’})
- del irises[‘x’]
- c.copy()
- apply() .apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
- G[‘prevalence’] = G[‘cases’] / G[‘population’]
- df.columns.difference(col_vals)
- pd.concat(melted_sections)
- df_v = df[df[key] == v]
- df = pd.read_sql_query(''' select count(*) as interactions, count(distinct user_id) as users, count(distinct book_id) as books, sum(is_read) reads, sum(rating > 0) ratings, sum(is_reviewed) reviews from interactions ''', conn)
- ubdf.merge(ubdf, on='book_id')\ .query('user_id_x != user_id_y')\ .groupby(['user_id_x', 'user_id_y'], as_index=False).count()\ .rename(columns={'book_id':'count'})\ .query(f'count >= {threshold}')
- d = df.to_dict(‘index’)[0] #dataframe to dictionary, map index and first value
- helper(**d)#unpack dictionary, it makes a list of the values (like dict.values?)
- A[:, i:i+1] slicing a column as a vector in numpy matrix
- samples = choice(len(X), size = k, replace=False) np.random.choice(array, size=x, replace=False)
- df[column] = df[column].mean(), .mode()
- df[column].fillna(new_val)
- np.argwhere(y==0).flatten (flattens 2d array)
    - np.argwhere(np.isfinite(arr))[-1] = get last non-NaN index
 
- df = df - df[drop_cols]
- df = df.drop(drop_cols, errors=”ignore”, axis=1)
- reshape((-2))
- if (not_missing > missing).any()
- list(product(*values))
Pandas Troubleshooting:
def pandas_troubleshoot(a,b):  # pass in the two dfs to compare as variables a and b
    from pandas.testing import assert_frame_equal
    
    def canonicalize_tibble(X, remove_index=True):
        var_names = sorted(X.columns)
        Y = X[var_names].copy()
        Y.sort_values(by=var_names, inplace=remove_index)
        Y.reset_index(drop=remove_index, inplace=remove_index)
        return Y
    
    a_hat = canonicalize_tibble(a)
    b_hat = canonicalize_tibble(b)
    try:
        assert_frame_equal(a_hat, b_hat, check_dtype=True,
                           check_index_type='equiv', check_column_type='equiv',
                           check_frame_type=True, check_names=True, 
                           by_blocks=False, check_exact=False, 
                           check_datetimelike_compat=False, 
                           check_categorical=True, check_like=False,
                           check_freq=True, check_flags=True,
                           rtol=1e-05, atol=1e-08, obj='DataFrame')
        print('True')
        return True
    except AssertionError as e:
        print(e)
        return e
##Joins:
- Merge: C = A.merge (B, on=['country', 'year'])
- Inner-join (A, B) (default): Keep only rows of A and B where the on-keys match in both. ONLY MATCH
- Outer-join (A, B): Keep all rows of both frames, but merge rows when the on-keys match. For non-matches, fill in missing values with not-a-number (NaN) values. KEEP ALL, FILL BLANKS
- Left-join (A, B): Keep all rows of A. Only merge rows of B whose on-keys match A.
- Right-join (A, B): Keep all rows of B. Only merge rows of A whose on-keys match B. C = A.merge (B, on=['country', 'year'], how='right')
Reminders:
- Dont forget to make a copy of every single df!!!
- One way to convert pandas df to dict, is to set one column (key) as index and then use dict(df.iloc[:, x], where x is the column (value) index
- For timestamp objects, you can do arithmetic operations such as + or -
- If you get the error, “referenced before assignment”, create a variable for it in the beginning. See problem 21 ex 4 or problem 24 (existing_journal).
- To convert a df to an sql table: conn = dfs_to_conn({‘my_table’: my_df}) cur = conn.cursor() cur.execute(‘select A, B, C from my_table’) result = cur.fetchall() conn.close() Convert list of tuples (ans) back to a df
- To access a row, use .loc or .iloc
- When you have an optional parameter, e.g. X = None, making a variable (or assigning the result of the if clause back to the variable) will help (Prob 24, Ex 4)
- To return the opposite of a mask, you can use - or ~ .map can be used to pass in a dict(key will be mapped on to values) Prob 20, Ex 4