Installation¶

Simple installation¶

To install this package run the following command

$ pip install git+https://github.com/aminbouraiss/pdhelpers.git

From a cloned repository¶

To install it from a cloned repository:

$ git clone https://github.com/aminbouraiss/pdhelpers.git

Then just run the setup.py file from that directory:

$ sudo python setup.py install

Instantiating the class¶

In [1]:

import pdHelpers
helpers = pdHelpers.Helpers()

Formatting values¶

The following methods changes the format of the displayed values to faciliate data analysis.

Round floats¶

This method automatically rounds values displayed by pandas

In [5]:

import random
import pdHelpers
helpers = pdHelpers.Helpers()


df = helpers.generate_Df() # Generate a test DataFrame
df.C = df.C.apply(lambda x: random.uniform(1, 10))
df.C

Out[5]:

  3.400638
  7.984441
  7.023958
  7.388083
Name: C, dtype: float64

In [7]:

helpers.round_floats() # Automatically round floats
df.C

Out[7]:

 3.40
 7.98
 7.02
 7.39
Name: C, dtype: float64

In [8]:

df.C.tolist() # The exported data is untouched

Out[8]:

[3.400637971393313, 7.98444055902644, 7.023957929542492, 7.38808312498648]

Change the columns width¶

Change the maximum column width (defaults to 150 characters wide).

In [1]:

import random
import pdHelpers

helpers = pdHelpers.Helpers()

# Create a dataFrame with a column 150 characters wide
df = helpers.generate_Df() # Generate a test DataFrame
df.C = df.C.apply(lambda x: random.uniform(1, 10))
longDf = (df[['A','C']]
 .assign(J='long text ' * 15 )
)

# Truncated values are replaced by an ellipsis.
longDf

Out[1]:

	A	C	J
0	1.0	8.803177	long text long text long text long text long t...
1	1.0	9.123195	long text long text long text long text long t...
2	1.0	4.611685	long text long text long text long text long t...
3	1.0	8.317267	long text long text long text long text long t...

In [2]:

# Change the column width to 300 characters
helpers.columnsWidth(300)

# The column is no longer truncated
longDf

Out[2]:

	A	C	J
0	1.0	8.803177	long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text
1	1.0	9.123195	long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text
2	1.0	4.611685	long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text
3	1.0	8.317267	long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text

Display more columns¶

Changes the maximum number of untruncated columns (defaults to 40).

In [1]:

import numpy as np
import pandas as pd
import pdHelpers

helpers = pdHelpers.Helpers()

# Create a dataFrame 24 columns wide
cols = np.random.rand(5,24)
df = pd.DataFrame(cols)

# Truncated columns are replaced by an ellipsis.
df

Out[1]:

	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
0	0.072880	0.763096	0.813992	0.278522	0.156192	0.376699	0.074667	0.339499	0.464146	0.555443	...	0.720152	0.791934	0.886861	0.243501	0.912817	0.745485	0.310526	0.596069	0.559586	0.787782
1	0.243877	0.993720	0.623229	0.119818	0.640403	0.977420	0.237943	0.668025	0.735992	0.464591	...	0.132466	0.171589	0.839932	0.457072	0.405125	0.507673	0.796509	0.376811	0.439065	0.124278
2	0.342922	0.614482	0.889817	0.691211	0.156495	0.340927	0.610162	0.754342	0.724617	0.094956	...	0.603796	0.298899	0.406135	0.563099	0.467862	0.128168	0.975015	0.673482	0.720828	0.110346
3	0.692669	0.580550	0.305752	0.582524	0.551636	0.796934	0.717072	0.876030	0.913818	0.036550	...	0.831621	0.526914	0.698815	0.984493	0.339227	0.435308	0.158343	0.702512	0.937638	0.144690
4	0.401414	0.122228	0.435908	0.509301	0.292995	0.269782	0.573507	0.526350	0.086608	0.906121	...	0.106145	0.686782	0.770461	0.197319	0.294803	0.231304	0.799670	0.585498	0.068396	0.556537

5 rows × 24 columns

In [2]:

# Set the maximum columns displayed to 30
helpers.maxColumns(30)

# All the columns are displayed
df

Out[2]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23
0	0.072880	0.763096	0.813992	0.278522	0.156192	0.376699	0.074667	0.339499	0.464146	0.555443	0.476052	0.052743	0.497787	0.900314	0.720152	0.791934	0.886861	0.243501	0.912817	0.745485	0.310526	0.596069	0.559586	0.787782
1	0.243877	0.993720	0.623229	0.119818	0.640403	0.977420	0.237943	0.668025	0.735992	0.464591	0.887112	0.195534	0.215006	0.651067	0.132466	0.171589	0.839932	0.457072	0.405125	0.507673	0.796509	0.376811	0.439065	0.124278
2	0.342922	0.614482	0.889817	0.691211	0.156495	0.340927	0.610162	0.754342	0.724617	0.094956	0.494325	0.831894	0.898089	0.097960	0.603796	0.298899	0.406135	0.563099	0.467862	0.128168	0.975015	0.673482	0.720828	0.110346
3	0.692669	0.580550	0.305752	0.582524	0.551636	0.796934	0.717072	0.876030	0.913818	0.036550	0.317423	0.134839	0.266308	0.411994	0.831621	0.526914	0.698815	0.984493	0.339227	0.435308	0.158343	0.702512	0.937638	0.144690
4	0.401414	0.122228	0.435908	0.509301	0.292995	0.269782	0.573507	0.526350	0.086608	0.906121	0.435981	0.426492	0.297920	0.094129	0.106145	0.686782	0.770461	0.197319	0.294803	0.231304	0.799670	0.585498	0.068396	0.556537

Analyzing a dataFrame¶

Return the value count for each column (incuding NaNs)¶

Returns the value count (including NaNs) for each column in a dict.

In [1]:

import numpy as np
import pandas as pd

# Create a dataFrame 24 columns wide
cols = np.random.rand(5,4)
df = pd.DataFrame(cols,columns=list('ABCD'))

df.A[:2] = df.A[4]
df.C[1:4] = df.C[0]
df.D[3] = df.D[2]
df.C[4] = None
df.D[4] = None
df.A[1] = None
df.D[:3] = 5

df

Out[1]:

	A	B	C	D
0	0.565309	0.006483	0.354735	5.000000
1	NaN	0.825698	0.354735	5.000000
2	0.954067	0.725308	0.354735	5.000000
3	0.153352	0.606400	0.354735	0.846074
4	0.565309	0.693732	NaN	NaN

In [2]:

import pdHelpers

# Instantiate the module
helpers = pdHelpers.Helpers()

# Print the count for the column C
colCount = helpers.valueCount(df)
print(colCount['C'])

      Value  Count Column
0  0.354735      4      C
1       NaN      1      C

Display the value count for each column (Including NaNs)¶

Displays the value count (including NaNs) for each column.

In [1]:

import numpy as np
import pandas as pd

# Create a dataFrame 24 columns wide
cols = np.random.rand(5,4)
df = pd.DataFrame(cols,columns=list('ABCD'))

df.A[:2] = df.A[4]
df.C[1:4] = df.C[0]
df.D[3] = df.D[2]
df.C[4] = None
df.D[4] = None
df.A[1] = None
df.D[:3] = 5

print(df)

          A         B         C         D
0  0.233378  0.510658  0.909901  5.000000
1       NaN  0.908558  0.909901  5.000000
2  0.000251  0.973638  0.909901  5.000000
3  0.721885  0.664116  0.909901  0.263451
4  0.233378  0.359453       NaN       NaN

In [2]:

import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Show the value count for each column
helpers.showValueCount(df)

Column A

	Value	Count	Column
0	0.233378	2	A
1	0.721885	1	A
2	0.000251	1	A
3	NaN	1	A

Column B

	Value	Count	Column
0	0.973638	1	B
1	0.359453	1	B
2	0.664116	1	B
3	0.908558	1	B
4	0.510658	1	B

Column C

	Value	Count	Column
0	0.909901	4	C
1	NaN	1	C

Column D

	Value	Count	Column
0	5.000000	3	D
1	0.263451	1	D
2	NaN	1	D

Find columns matching a Regex pattern¶

In [1]:

import numpy as np
import pandas as pd

# Create a dataFrame 24 columns wide
cols = np.random.rand(5,3)
df = pd.DataFrame(cols,columns=['Foo','Bar','FooBar'])

print(df)

        Foo       Bar    FooBar
0  0.017451  0.331383  0.496636
1  0.950639  0.054829  0.056220
2  0.845740  0.359787  0.809131
3  0.640001  0.264786  0.146170
4  0.308577  0.832595  0.948198

In [2]:

import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Return the column names ending with 'ar'
helpers.searchCols(df,'ar$')

Out[2]:

Index([u'Bar', u'FooBar'], dtype='object')

Find a dataFrame’s dimensions and Metrics¶

Find a DataFrame’s dimensions and metrics, the dimensions are columns matching one the following dtypes:

datetime64
category
object
datetime

In [3]:

import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Generate a dataFrame
df = helpers.generate_Df()
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    4 non-null float32
D    4 non-null int32
E    4 non-null category
F    4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes
None

Out[3]:

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

In [2]:

# Find the dataFrame's dimensions
helpers.findDimensions(df)

Out[2]:

['B', 'E', 'F']

In [3]:

# Find the dataFrame's metrics
helpers.findMetrics(df)

Out[3]:

['A', 'C', 'D']

Compare columns between dataFrames¶

commonCols - Get common columns¶

Perform a discrepancy check between two data frames, it compares on the common dimensions and metrics between the two data frames.

This method returns a dict with three keys:

commonDims: The dimensions present in both columns.both.
commonMetrics: The metrics present in both columns.
commonCols: The columns present in df1 not present in df2 (metrics + dimensions)

diffCols - Get diverging columns¶

Get the names of the columns present in the first dataframe specified and absent in the second.

This method returns a dict with three keys:

dimDiff: The dimensions present in df1 not present in df2.
metricDiff: The metrics present in df1 not present in df2.
allcolsDif: The columns present in df1 not present in df2 (metrics + dimensions).

In [8]:

import pdHelpers
import pprint

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Generate the dataFrames
df1 = helpers.generate_Df()
df2 = (df[['B','D']]
         .assign(H=df.D*3))

# Print the two dataframes
separation = "_" * 40
print("df1\n{0}\n\n{1}\n".format(separation,df1))
print("df2\n{0}\n\n{1}\n".format(separation,df2))

# find the commmon columns
common = helpers.commonCols(df1,df2)
print("Common columns\n{0}\n".format(separation))
pprint.pprint(common)

# find the diverging columns
difference = helpers.diffCols(df1,df2)
print("\nDiverging columns\n{0}\n".format(separation))
pprint.pprint(difference)

df1
________________________________________

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

df2
________________________________________

           B  D  H
0 2013-01-02  3  9
1 2013-01-02  3  9
2 2013-01-02  3  9
3 2013-01-02  3  9

Common columns
________________________________________

{'commonCols': ['B', 'D'], 'commonDims': ['B'], 'commonMetrics': ['D']}

Diverging columns
________________________________________

{'allcolsDif': ['A', 'C', 'E', 'F'],
 'dimDiff': ['E', 'F'],
 'metricDiff': ['A', 'C']}

Generate a sample dataFrame¶

Generate a sample pandas DataFrame with the following column types:

float64
datetime64[ns]
float32
int32
int32
category
object

In [1]:

import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Generate a sample dataFrame
df = helpers.generate_Df()

df.info()

df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    4 non-null float32
D    4 non-null int32
E    4 non-null category
F    4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes

Out[1]:

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

Delete columns safely¶

Safely delete a DataFrame column whithout raising an error if the column doesn’t exist.

In [2]:

import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

df = helpers.generate_Df()
df

Out[2]:

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

In [3]:

# Try to delete a non existing column
helpers.deleteCol(df,'G')
df

Out[3]:

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

Appending and replacing new dates to a dataFrame¶

The following method appends new rows to a dataframe from another dataframe based on its date index. Duplicate rows are replaced by rows from the newer dataFrame.

Let’s first create two sample dataFrames:

In [9]:

import pandas as pd
import numpy as np


def generateValues(timeSeries):
    for i in range(len(timeSeries)):
        yield np.random.randint(1, 10)


# Set the time series
ts1 = pd.date_range('2012-04-01', '2012-04-06')
ts2 = pd.date_range('2012-04-03', '2012-04-08')

# Generate the dataFrames
df1 = pd.DataFrame({'Date': ts1, 'Values': list(generateValues(ts1))})
df2 = pd.DataFrame({'Date': ts2, 'Values': list(generateValues(ts2))})

separation = "_" * 40
print("df1\n{0}\n\n{1}\n".format(separation,df1))
print("df2\n{0}\n\n{1}\n".format(separation,df2))

df1
________________________________________

        Date  Values
0 2012-04-01       7
1 2012-04-02       6
2 2012-04-03       8
3 2012-04-04       4
4 2012-04-05       9
5 2012-04-06       9

df2
________________________________________

        Date  Values
0 2012-04-03       6
1 2012-04-04       6
2 2012-04-05       1
3 2012-04-06       7
4 2012-04-07       7
5 2012-04-08       3

You can append the new values based on a column containing datetime values:

In [6]:

import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# append the new dates
newDf = helpers.appendNewDates(df1,df2,index_column='Date')
newDf

Out[6]:

	Date	Values
0	2012-04-01	6
1	2012-04-02	6
0	2012-04-03	6
1	2012-04-04	9
2	2012-04-05	4
3	2012-04-06	9
4	2012-04-07	6
5	2012-04-08	8

You can also append the new values based the dataFrame’s date index:

In [7]:

df1_indexed = df1.set_index('Date')
df2_indexed = df2.set_index('Date')

df_indexed = helpers.appendNewDates(df1_indexed, df2_indexed)
df_indexed

Out[7]:

	Values
Date
2012-04-01	6
2012-04-02	6
2012-04-03	6
2012-04-04	9
2012-04-05	4
2012-04-06	9
2012-04-07	6
2012-04-08	8

Convert a column to datetime index¶

Converts a date column to a datetime format and sets it as a sorted index.

In [10]:

import numpy as np
import pandas as pd
import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Create the time series
ts = ('2012-04-0{}'.format(x) for x in range(1,9))

# Generate the dataFrame
df = pd.DataFrame(dict(Dates=list(ts), Values=[
                        np.random.randint(1, 10) for x in range(1,9)]))

# convert the date column and set it as index
indexed_df = helpers.setDateIndex(df,'Dates')

# Print the results
separation = "_" * 40
print("The original dataframe dtypes\n{0}".format(separation))
info = df.info()
print("\nThe original dataFrame\n{0}\n\n{1}\n".format(separation,df))
print("\nThe converted dataFrame\n{0}\n\n{1}\n".format(separation,indexed_df))

The original dataframe dtypes
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
Dates     8 non-null object
Values    8 non-null int64
dtypes: int64(1), object(1)
memory usage: 200.0+ bytes

The original dataFrame
________________________________________

        Dates  Values
0  2012-04-01       2
1  2012-04-02       1
2  2012-04-03       3
3  2012-04-04       3
4  2012-04-05       9
5  2012-04-06       6
6  2012-04-07       3
7  2012-04-08       1


The converted dataFrame
________________________________________

            Values
Date
2012-04-01       2
2012-04-02       1
2012-04-03       3
2012-04-04       3
2012-04-05       9
2012-04-06       6
2012-04-07       3
2012-04-08       1