Installation

Simple installation

To install this package run the following command

$ pip install git+https://github.com/aminbouraiss/pdhelpers.git

From a cloned repository

To install it from a cloned repository:

$ git clone https://github.com/aminbouraiss/pdhelpers.git

Then just run the setup.py file from that directory:

$ sudo python setup.py install

Instantiating the class

In [1]:
import pdHelpers
helpers = pdHelpers.Helpers()

Formatting values

The following methods changes the format of the displayed values to faciliate data analysis.

Round floats

This method automatically rounds values displayed by pandas

In [5]:
import random
import pdHelpers
helpers = pdHelpers.Helpers()


df = helpers.generate_Df() # Generate a test DataFrame
df.C = df.C.apply(lambda x: random.uniform(1, 10))
df.C
Out[5]:
0    3.400638
1    7.984441
2    7.023958
3    7.388083
Name: C, dtype: float64
In [7]:
helpers.round_floats() # Automatically round floats
df.C
Out[7]:
0   3.40
1   7.98
2   7.02
3   7.39
Name: C, dtype: float64
In [8]:
df.C.tolist() # The exported data is untouched
Out[8]:
[3.400637971393313, 7.98444055902644, 7.023957929542492, 7.38808312498648]

Change the columns width

Change the maximum column width (defaults to 150 characters wide).

In [1]:
import random
import pdHelpers

helpers = pdHelpers.Helpers()

# Create a dataFrame with a column 150 characters wide
df = helpers.generate_Df() # Generate a test DataFrame
df.C = df.C.apply(lambda x: random.uniform(1, 10))
longDf = (df[['A','C']]
 .assign(J='long text ' * 15 )
)

# Truncated values are replaced by an ellipsis.
longDf
Out[1]:
A C J
0 1.0 8.803177 long text long text long text long text long t...
1 1.0 9.123195 long text long text long text long text long t...
2 1.0 4.611685 long text long text long text long text long t...
3 1.0 8.317267 long text long text long text long text long t...
In [2]:
# Change the column width to 300 characters
helpers.columnsWidth(300)

# The column is no longer truncated
longDf
Out[2]:
A C J
0 1.0 8.803177 long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text
1 1.0 9.123195 long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text
2 1.0 4.611685 long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text
3 1.0 8.317267 long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text

Display more columns

Changes the maximum number of untruncated columns (defaults to 40).

In [1]:
import numpy as np
import pandas as pd
import pdHelpers

helpers = pdHelpers.Helpers()

# Create a dataFrame 24 columns wide
cols = np.random.rand(5,24)
df = pd.DataFrame(cols)

# Truncated columns are replaced by an ellipsis.
df
Out[1]:
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
0 0.072880 0.763096 0.813992 0.278522 0.156192 0.376699 0.074667 0.339499 0.464146 0.555443 ... 0.720152 0.791934 0.886861 0.243501 0.912817 0.745485 0.310526 0.596069 0.559586 0.787782
1 0.243877 0.993720 0.623229 0.119818 0.640403 0.977420 0.237943 0.668025 0.735992 0.464591 ... 0.132466 0.171589 0.839932 0.457072 0.405125 0.507673 0.796509 0.376811 0.439065 0.124278
2 0.342922 0.614482 0.889817 0.691211 0.156495 0.340927 0.610162 0.754342 0.724617 0.094956 ... 0.603796 0.298899 0.406135 0.563099 0.467862 0.128168 0.975015 0.673482 0.720828 0.110346
3 0.692669 0.580550 0.305752 0.582524 0.551636 0.796934 0.717072 0.876030 0.913818 0.036550 ... 0.831621 0.526914 0.698815 0.984493 0.339227 0.435308 0.158343 0.702512 0.937638 0.144690
4 0.401414 0.122228 0.435908 0.509301 0.292995 0.269782 0.573507 0.526350 0.086608 0.906121 ... 0.106145 0.686782 0.770461 0.197319 0.294803 0.231304 0.799670 0.585498 0.068396 0.556537

5 rows × 24 columns

In [2]:
# Set the maximum columns displayed to 30
helpers.maxColumns(30)

# All the columns are displayed
df
Out[2]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0 0.072880 0.763096 0.813992 0.278522 0.156192 0.376699 0.074667 0.339499 0.464146 0.555443 0.476052 0.052743 0.497787 0.900314 0.720152 0.791934 0.886861 0.243501 0.912817 0.745485 0.310526 0.596069 0.559586 0.787782
1 0.243877 0.993720 0.623229 0.119818 0.640403 0.977420 0.237943 0.668025 0.735992 0.464591 0.887112 0.195534 0.215006 0.651067 0.132466 0.171589 0.839932 0.457072 0.405125 0.507673 0.796509 0.376811 0.439065 0.124278
2 0.342922 0.614482 0.889817 0.691211 0.156495 0.340927 0.610162 0.754342 0.724617 0.094956 0.494325 0.831894 0.898089 0.097960 0.603796 0.298899 0.406135 0.563099 0.467862 0.128168 0.975015 0.673482 0.720828 0.110346
3 0.692669 0.580550 0.305752 0.582524 0.551636 0.796934 0.717072 0.876030 0.913818 0.036550 0.317423 0.134839 0.266308 0.411994 0.831621 0.526914 0.698815 0.984493 0.339227 0.435308 0.158343 0.702512 0.937638 0.144690
4 0.401414 0.122228 0.435908 0.509301 0.292995 0.269782 0.573507 0.526350 0.086608 0.906121 0.435981 0.426492 0.297920 0.094129 0.106145 0.686782 0.770461 0.197319 0.294803 0.231304 0.799670 0.585498 0.068396 0.556537

Analyzing a dataFrame

Return the value count for each column (incuding NaNs)

Returns the value count (including NaNs) for each column in a dict.

In [1]:
import numpy as np
import pandas as pd

# Create a dataFrame 24 columns wide
cols = np.random.rand(5,4)
df = pd.DataFrame(cols,columns=list('ABCD'))

df.A[:2] = df.A[4]
df.C[1:4] = df.C[0]
df.D[3] = df.D[2]
df.C[4] = None
df.D[4] = None
df.A[1] = None
df.D[:3] = 5

df
Out[1]:
A B C D
0 0.565309 0.006483 0.354735 5.000000
1 NaN 0.825698 0.354735 5.000000
2 0.954067 0.725308 0.354735 5.000000
3 0.153352 0.606400 0.354735 0.846074
4 0.565309 0.693732 NaN NaN
In [2]:
import pdHelpers

# Instantiate the module
helpers = pdHelpers.Helpers()

# Print the count for the column C
colCount = helpers.valueCount(df)
print(colCount['C'])
      Value  Count Column
0  0.354735      4      C
1       NaN      1      C

Display the value count for each column (Including NaNs)

Displays the value count (including NaNs) for each column.

In [1]:
import numpy as np
import pandas as pd

# Create a dataFrame 24 columns wide
cols = np.random.rand(5,4)
df = pd.DataFrame(cols,columns=list('ABCD'))

df.A[:2] = df.A[4]
df.C[1:4] = df.C[0]
df.D[3] = df.D[2]
df.C[4] = None
df.D[4] = None
df.A[1] = None
df.D[:3] = 5

print(df)
          A         B         C         D
0  0.233378  0.510658  0.909901  5.000000
1       NaN  0.908558  0.909901  5.000000
2  0.000251  0.973638  0.909901  5.000000
3  0.721885  0.664116  0.909901  0.263451
4  0.233378  0.359453       NaN       NaN
In [2]:
import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Show the value count for each column
helpers.showValueCount(df)

Column A

Value Count Column
0 0.233378 2 A
1 0.721885 1 A
2 0.000251 1 A
3 NaN 1 A

Column B

Value Count Column
0 0.973638 1 B
1 0.359453 1 B
2 0.664116 1 B
3 0.908558 1 B
4 0.510658 1 B

Column C

Value Count Column
0 0.909901 4 C
1 NaN 1 C

Column D

Value Count Column
0 5.000000 3 D
1 0.263451 1 D
2 NaN 1 D

Find columns matching a Regex pattern

In [1]:
import numpy as np
import pandas as pd

# Create a dataFrame 24 columns wide
cols = np.random.rand(5,3)
df = pd.DataFrame(cols,columns=['Foo','Bar','FooBar'])

print(df)
        Foo       Bar    FooBar
0  0.017451  0.331383  0.496636
1  0.950639  0.054829  0.056220
2  0.845740  0.359787  0.809131
3  0.640001  0.264786  0.146170
4  0.308577  0.832595  0.948198
In [2]:
import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Return the column names ending with 'ar'
helpers.searchCols(df,'ar$')
Out[2]:
Index([u'Bar', u'FooBar'], dtype='object')

Find a dataFrame’s dimensions and Metrics

Find a DataFrame’s dimensions and metrics, the dimensions are columns matching one the following dtypes:

  • datetime64
  • category
  • object
  • datetime
In [3]:
import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Generate a dataFrame
df = helpers.generate_Df()
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    4 non-null float32
D    4 non-null int32
E    4 non-null category
F    4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes
None
Out[3]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [2]:
# Find the dataFrame's dimensions
helpers.findDimensions(df)
Out[2]:
['B', 'E', 'F']
In [3]:
# Find the dataFrame's metrics
helpers.findMetrics(df)
Out[3]:
['A', 'C', 'D']

Compare columns between dataFrames

commonCols - Get common columns

Perform a discrepancy check between two data frames, it compares on the common dimensions and metrics between the two data frames.

This method returns a dict with three keys:

  • commonDims: The dimensions present in both columns.both.
  • commonMetrics: The metrics present in both columns.
  • commonCols: The columns present in df1 not present in df2 (metrics + dimensions)

diffCols - Get diverging columns

Get the names of the columns present in the first dataframe specified and absent in the second.

This method returns a dict with three keys:

  • dimDiff: The dimensions present in df1 not present in df2.
  • metricDiff: The metrics present in df1 not present in df2.
  • allcolsDif: The columns present in df1 not present in df2 (metrics + dimensions).
In [8]:
import pdHelpers
import pprint

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Generate the dataFrames
df1 = helpers.generate_Df()
df2 = (df[['B','D']]
         .assign(H=df.D*3))

# Print the two dataframes
separation = "_" * 40
print("df1\n{0}\n\n{1}\n".format(separation,df1))
print("df2\n{0}\n\n{1}\n".format(separation,df2))

# find the commmon columns
common = helpers.commonCols(df1,df2)
print("Common columns\n{0}\n".format(separation))
pprint.pprint(common)

# find the diverging columns
difference = helpers.diffCols(df1,df2)
print("\nDiverging columns\n{0}\n".format(separation))
pprint.pprint(difference)
df1
________________________________________

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

df2
________________________________________

           B  D  H
0 2013-01-02  3  9
1 2013-01-02  3  9
2 2013-01-02  3  9
3 2013-01-02  3  9

Common columns
________________________________________

{'commonCols': ['B', 'D'], 'commonDims': ['B'], 'commonMetrics': ['D']}

Diverging columns
________________________________________

{'allcolsDif': ['A', 'C', 'E', 'F'],
 'dimDiff': ['E', 'F'],
 'metricDiff': ['A', 'C']}

Generate a sample dataFrame

Generate a sample pandas DataFrame with the following column types:

  • float64
  • datetime64[ns]
  • float32
  • int32
  • int32
  • category
  • object
In [1]:
import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Generate a sample dataFrame
df = helpers.generate_Df()

df.info()

df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
A    4 non-null float64
B    4 non-null datetime64[ns]
C    4 non-null float32
D    4 non-null int32
E    4 non-null category
F    4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes
Out[1]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

Delete columns safely

Safely delete a DataFrame column whithout raising an error if the column doesn’t exist.

In [2]:
import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

df = helpers.generate_Df()
df
Out[2]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [3]:
# Try to delete a non existing column
helpers.deleteCol(df,'G')
df
Out[3]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

Appending and replacing new dates to a dataFrame

The following method appends new rows to a dataframe from another dataframe based on its date index. Duplicate rows are replaced by rows from the newer dataFrame.

Let’s first create two sample dataFrames:

In [9]:
import pandas as pd
import numpy as np


def generateValues(timeSeries):
    for i in range(len(timeSeries)):
        yield np.random.randint(1, 10)


# Set the time series
ts1 = pd.date_range('2012-04-01', '2012-04-06')
ts2 = pd.date_range('2012-04-03', '2012-04-08')

# Generate the dataFrames
df1 = pd.DataFrame({'Date': ts1, 'Values': list(generateValues(ts1))})
df2 = pd.DataFrame({'Date': ts2, 'Values': list(generateValues(ts2))})

separation = "_" * 40
print("df1\n{0}\n\n{1}\n".format(separation,df1))
print("df2\n{0}\n\n{1}\n".format(separation,df2))

df1
________________________________________

        Date  Values
0 2012-04-01       7
1 2012-04-02       6
2 2012-04-03       8
3 2012-04-04       4
4 2012-04-05       9
5 2012-04-06       9

df2
________________________________________

        Date  Values
0 2012-04-03       6
1 2012-04-04       6
2 2012-04-05       1
3 2012-04-06       7
4 2012-04-07       7
5 2012-04-08       3

You can append the new values based on a column containing datetime values:

In [6]:
import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# append the new dates
newDf = helpers.appendNewDates(df1,df2,index_column='Date')
newDf
Out[6]:
Date Values
0 2012-04-01 6
1 2012-04-02 6
0 2012-04-03 6
1 2012-04-04 9
2 2012-04-05 4
3 2012-04-06 9
4 2012-04-07 6
5 2012-04-08 8

You can also append the new values based the dataFrame’s date index:

In [7]:
df1_indexed = df1.set_index('Date')
df2_indexed = df2.set_index('Date')

df_indexed = helpers.appendNewDates(df1_indexed, df2_indexed)
df_indexed

Out[7]:
Values
Date
2012-04-01 6
2012-04-02 6
2012-04-03 6
2012-04-04 9
2012-04-05 4
2012-04-06 9
2012-04-07 6
2012-04-08 8

Convert a column to datetime index

Converts a date column to a datetime format and sets it as a sorted index.

In [10]:
import numpy as np
import pandas as pd
import pdHelpers

# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()

# Create the time series
ts = ('2012-04-0{}'.format(x) for x in range(1,9))

# Generate the dataFrame
df = pd.DataFrame(dict(Dates=list(ts), Values=[
                        np.random.randint(1, 10) for x in range(1,9)]))

# convert the date column and set it as index
indexed_df = helpers.setDateIndex(df,'Dates')

# Print the results
separation = "_" * 40
print("The original dataframe dtypes\n{0}".format(separation))
info = df.info()
print("\nThe original dataFrame\n{0}\n\n{1}\n".format(separation,df))
print("\nThe converted dataFrame\n{0}\n\n{1}\n".format(separation,indexed_df))

The original dataframe dtypes
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
Dates     8 non-null object
Values    8 non-null int64
dtypes: int64(1), object(1)
memory usage: 200.0+ bytes

The original dataFrame
________________________________________

        Dates  Values
0  2012-04-01       2
1  2012-04-02       1
2  2012-04-03       3
3  2012-04-04       3
4  2012-04-05       9
5  2012-04-06       6
6  2012-04-07       3
7  2012-04-08       1


The converted dataFrame
________________________________________

            Values
Date
2012-04-01       2
2012-04-02       1
2012-04-03       3
2012-04-04       3
2012-04-05       9
2012-04-06       6
2012-04-07       3
2012-04-08       1