Installation¶
Simple installation¶
To install this package run the following command
$ pip install git+https://github.com/aminbouraiss/pdhelpers.git
From a cloned repository¶
To install it from a cloned repository:
$ git clone https://github.com/aminbouraiss/pdhelpers.git
Then just run the setup.py file from that directory:
$ sudo python setup.py install
Instantiating the class¶
In [1]:
import pdHelpers
helpers = pdHelpers.Helpers()
Formatting values¶
The following methods changes the format of the displayed values to faciliate data analysis.
Round floats¶
This method automatically rounds values displayed by pandas
In [5]:
import random
import pdHelpers
helpers = pdHelpers.Helpers()
df = helpers.generate_Df() # Generate a test DataFrame
df.C = df.C.apply(lambda x: random.uniform(1, 10))
df.C
Out[5]:
0 3.400638
1 7.984441
2 7.023958
3 7.388083
Name: C, dtype: float64
In [7]:
helpers.round_floats() # Automatically round floats
df.C
Out[7]:
0 3.40
1 7.98
2 7.02
3 7.39
Name: C, dtype: float64
In [8]:
df.C.tolist() # The exported data is untouched
Out[8]:
[3.400637971393313, 7.98444055902644, 7.023957929542492, 7.38808312498648]
Change the columns width¶
Change the maximum column width (defaults to 150 characters wide).
In [1]:
import random
import pdHelpers
helpers = pdHelpers.Helpers()
# Create a dataFrame with a column 150 characters wide
df = helpers.generate_Df() # Generate a test DataFrame
df.C = df.C.apply(lambda x: random.uniform(1, 10))
longDf = (df[['A','C']]
.assign(J='long text ' * 15 )
)
# Truncated values are replaced by an ellipsis.
longDf
Out[1]:
| A | C | J | |
|---|---|---|---|
| 0 | 1.0 | 8.803177 | long text long text long text long text long t... |
| 1 | 1.0 | 9.123195 | long text long text long text long text long t... |
| 2 | 1.0 | 4.611685 | long text long text long text long text long t... |
| 3 | 1.0 | 8.317267 | long text long text long text long text long t... |
In [2]:
# Change the column width to 300 characters
helpers.columnsWidth(300)
# The column is no longer truncated
longDf
Out[2]:
| A | C | J | |
|---|---|---|---|
| 0 | 1.0 | 8.803177 | long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text |
| 1 | 1.0 | 9.123195 | long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text |
| 2 | 1.0 | 4.611685 | long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text |
| 3 | 1.0 | 8.317267 | long text long text long text long text long text long text long text long text long text long text long text long text long text long text long text |
Display more columns¶
Changes the maximum number of untruncated columns (defaults to 40).
In [1]:
import numpy as np
import pandas as pd
import pdHelpers
helpers = pdHelpers.Helpers()
# Create a dataFrame 24 columns wide
cols = np.random.rand(5,24)
df = pd.DataFrame(cols)
# Truncated columns are replaced by an ellipsis.
df
Out[1]:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.072880 | 0.763096 | 0.813992 | 0.278522 | 0.156192 | 0.376699 | 0.074667 | 0.339499 | 0.464146 | 0.555443 | ... | 0.720152 | 0.791934 | 0.886861 | 0.243501 | 0.912817 | 0.745485 | 0.310526 | 0.596069 | 0.559586 | 0.787782 |
| 1 | 0.243877 | 0.993720 | 0.623229 | 0.119818 | 0.640403 | 0.977420 | 0.237943 | 0.668025 | 0.735992 | 0.464591 | ... | 0.132466 | 0.171589 | 0.839932 | 0.457072 | 0.405125 | 0.507673 | 0.796509 | 0.376811 | 0.439065 | 0.124278 |
| 2 | 0.342922 | 0.614482 | 0.889817 | 0.691211 | 0.156495 | 0.340927 | 0.610162 | 0.754342 | 0.724617 | 0.094956 | ... | 0.603796 | 0.298899 | 0.406135 | 0.563099 | 0.467862 | 0.128168 | 0.975015 | 0.673482 | 0.720828 | 0.110346 |
| 3 | 0.692669 | 0.580550 | 0.305752 | 0.582524 | 0.551636 | 0.796934 | 0.717072 | 0.876030 | 0.913818 | 0.036550 | ... | 0.831621 | 0.526914 | 0.698815 | 0.984493 | 0.339227 | 0.435308 | 0.158343 | 0.702512 | 0.937638 | 0.144690 |
| 4 | 0.401414 | 0.122228 | 0.435908 | 0.509301 | 0.292995 | 0.269782 | 0.573507 | 0.526350 | 0.086608 | 0.906121 | ... | 0.106145 | 0.686782 | 0.770461 | 0.197319 | 0.294803 | 0.231304 | 0.799670 | 0.585498 | 0.068396 | 0.556537 |
5 rows × 24 columns
In [2]:
# Set the maximum columns displayed to 30
helpers.maxColumns(30)
# All the columns are displayed
df
Out[2]:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.072880 | 0.763096 | 0.813992 | 0.278522 | 0.156192 | 0.376699 | 0.074667 | 0.339499 | 0.464146 | 0.555443 | 0.476052 | 0.052743 | 0.497787 | 0.900314 | 0.720152 | 0.791934 | 0.886861 | 0.243501 | 0.912817 | 0.745485 | 0.310526 | 0.596069 | 0.559586 | 0.787782 |
| 1 | 0.243877 | 0.993720 | 0.623229 | 0.119818 | 0.640403 | 0.977420 | 0.237943 | 0.668025 | 0.735992 | 0.464591 | 0.887112 | 0.195534 | 0.215006 | 0.651067 | 0.132466 | 0.171589 | 0.839932 | 0.457072 | 0.405125 | 0.507673 | 0.796509 | 0.376811 | 0.439065 | 0.124278 |
| 2 | 0.342922 | 0.614482 | 0.889817 | 0.691211 | 0.156495 | 0.340927 | 0.610162 | 0.754342 | 0.724617 | 0.094956 | 0.494325 | 0.831894 | 0.898089 | 0.097960 | 0.603796 | 0.298899 | 0.406135 | 0.563099 | 0.467862 | 0.128168 | 0.975015 | 0.673482 | 0.720828 | 0.110346 |
| 3 | 0.692669 | 0.580550 | 0.305752 | 0.582524 | 0.551636 | 0.796934 | 0.717072 | 0.876030 | 0.913818 | 0.036550 | 0.317423 | 0.134839 | 0.266308 | 0.411994 | 0.831621 | 0.526914 | 0.698815 | 0.984493 | 0.339227 | 0.435308 | 0.158343 | 0.702512 | 0.937638 | 0.144690 |
| 4 | 0.401414 | 0.122228 | 0.435908 | 0.509301 | 0.292995 | 0.269782 | 0.573507 | 0.526350 | 0.086608 | 0.906121 | 0.435981 | 0.426492 | 0.297920 | 0.094129 | 0.106145 | 0.686782 | 0.770461 | 0.197319 | 0.294803 | 0.231304 | 0.799670 | 0.585498 | 0.068396 | 0.556537 |
Analyzing a dataFrame¶
Return the value count for each column (incuding NaNs)¶
Returns the value count (including NaNs) for each column in a dict.
In [1]:
import numpy as np
import pandas as pd
# Create a dataFrame 24 columns wide
cols = np.random.rand(5,4)
df = pd.DataFrame(cols,columns=list('ABCD'))
df.A[:2] = df.A[4]
df.C[1:4] = df.C[0]
df.D[3] = df.D[2]
df.C[4] = None
df.D[4] = None
df.A[1] = None
df.D[:3] = 5
df
Out[1]:
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0.565309 | 0.006483 | 0.354735 | 5.000000 |
| 1 | NaN | 0.825698 | 0.354735 | 5.000000 |
| 2 | 0.954067 | 0.725308 | 0.354735 | 5.000000 |
| 3 | 0.153352 | 0.606400 | 0.354735 | 0.846074 |
| 4 | 0.565309 | 0.693732 | NaN | NaN |
In [2]:
import pdHelpers
# Instantiate the module
helpers = pdHelpers.Helpers()
# Print the count for the column C
colCount = helpers.valueCount(df)
print(colCount['C'])
Value Count Column
0 0.354735 4 C
1 NaN 1 C
Display the value count for each column (Including NaNs)¶
Displays the value count (including NaNs) for each column.
In [1]:
import numpy as np
import pandas as pd
# Create a dataFrame 24 columns wide
cols = np.random.rand(5,4)
df = pd.DataFrame(cols,columns=list('ABCD'))
df.A[:2] = df.A[4]
df.C[1:4] = df.C[0]
df.D[3] = df.D[2]
df.C[4] = None
df.D[4] = None
df.A[1] = None
df.D[:3] = 5
print(df)
A B C D
0 0.233378 0.510658 0.909901 5.000000
1 NaN 0.908558 0.909901 5.000000
2 0.000251 0.973638 0.909901 5.000000
3 0.721885 0.664116 0.909901 0.263451
4 0.233378 0.359453 NaN NaN
In [2]:
import pdHelpers
# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()
# Show the value count for each column
helpers.showValueCount(df)
Column A
| Value | Count | Column | |
|---|---|---|---|
| 0 | 0.233378 | 2 | A |
| 1 | 0.721885 | 1 | A |
| 2 | 0.000251 | 1 | A |
| 3 | NaN | 1 | A |
Column B
| Value | Count | Column | |
|---|---|---|---|
| 0 | 0.973638 | 1 | B |
| 1 | 0.359453 | 1 | B |
| 2 | 0.664116 | 1 | B |
| 3 | 0.908558 | 1 | B |
| 4 | 0.510658 | 1 | B |
Column C
| Value | Count | Column | |
|---|---|---|---|
| 0 | 0.909901 | 4 | C |
| 1 | NaN | 1 | C |
Column D
| Value | Count | Column | |
|---|---|---|---|
| 0 | 5.000000 | 3 | D |
| 1 | 0.263451 | 1 | D |
| 2 | NaN | 1 | D |
Find columns matching a Regex pattern¶
In [1]:
import numpy as np
import pandas as pd
# Create a dataFrame 24 columns wide
cols = np.random.rand(5,3)
df = pd.DataFrame(cols,columns=['Foo','Bar','FooBar'])
print(df)
Foo Bar FooBar
0 0.017451 0.331383 0.496636
1 0.950639 0.054829 0.056220
2 0.845740 0.359787 0.809131
3 0.640001 0.264786 0.146170
4 0.308577 0.832595 0.948198
In [2]:
import pdHelpers
# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()
# Return the column names ending with 'ar'
helpers.searchCols(df,'ar$')
Out[2]:
Index([u'Bar', u'FooBar'], dtype='object')
Find a dataFrame’s dimensions and Metrics¶
Find a DataFrame’s dimensions and metrics, the dimensions are columns matching one the following dtypes:
- datetime64
- category
- object
- datetime
In [3]:
import pdHelpers
# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()
# Generate a dataFrame
df = helpers.generate_Df()
print(df.info())
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
A 4 non-null float64
B 4 non-null datetime64[ns]
C 4 non-null float32
D 4 non-null int32
E 4 non-null category
F 4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes
None
Out[3]:
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
| 2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
In [2]:
# Find the dataFrame's dimensions
helpers.findDimensions(df)
Out[2]:
['B', 'E', 'F']
In [3]:
# Find the dataFrame's metrics
helpers.findMetrics(df)
Out[3]:
['A', 'C', 'D']
Compare columns between dataFrames¶
commonCols - Get common columns¶
Perform a discrepancy check between two data frames, it compares on the common dimensions and metrics between the two data frames.
This method returns a dict with three keys:
- commonDims: The dimensions present in both columns.both.
- commonMetrics: The metrics present in both columns.
- commonCols: The columns present in df1 not present in df2 (metrics + dimensions)
diffCols - Get diverging columns¶
Get the names of the columns present in the first dataframe specified and absent in the second.
This method returns a dict with three keys:
- dimDiff: The dimensions present in df1 not present in df2.
- metricDiff: The metrics present in df1 not present in df2.
- allcolsDif: The columns present in df1 not present in df2 (metrics + dimensions).
In [8]:
import pdHelpers
import pprint
# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()
# Generate the dataFrames
df1 = helpers.generate_Df()
df2 = (df[['B','D']]
.assign(H=df.D*3))
# Print the two dataframes
separation = "_" * 40
print("df1\n{0}\n\n{1}\n".format(separation,df1))
print("df2\n{0}\n\n{1}\n".format(separation,df2))
# find the commmon columns
common = helpers.commonCols(df1,df2)
print("Common columns\n{0}\n".format(separation))
pprint.pprint(common)
# find the diverging columns
difference = helpers.diffCols(df1,df2)
print("\nDiverging columns\n{0}\n".format(separation))
pprint.pprint(difference)
df1
________________________________________
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
df2
________________________________________
B D H
0 2013-01-02 3 9
1 2013-01-02 3 9
2 2013-01-02 3 9
3 2013-01-02 3 9
Common columns
________________________________________
{'commonCols': ['B', 'D'], 'commonDims': ['B'], 'commonMetrics': ['D']}
Diverging columns
________________________________________
{'allcolsDif': ['A', 'C', 'E', 'F'],
'dimDiff': ['E', 'F'],
'metricDiff': ['A', 'C']}
Generate a sample dataFrame¶
Generate a sample pandas DataFrame with the following column types:
- float64
- datetime64[ns]
- float32
- int32
- int32
- category
- object
In [1]:
import pdHelpers
# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()
# Generate a sample dataFrame
df = helpers.generate_Df()
df.info()
df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
A 4 non-null float64
B 4 non-null datetime64[ns]
C 4 non-null float32
D 4 non-null int32
E 4 non-null category
F 4 non-null object
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 260.0+ bytes
Out[1]:
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
| 2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
Delete columns safely¶
Safely delete a DataFrame column whithout raising an error if the column doesn’t exist.
In [2]:
import pdHelpers
# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()
df = helpers.generate_Df()
df
Out[2]:
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
| 2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
In [3]:
# Try to delete a non existing column
helpers.deleteCol(df,'G')
df
Out[3]:
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
| 2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
| 3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
Appending and replacing new dates to a dataFrame¶
The following method appends new rows to a dataframe from another dataframe based on its date index. Duplicate rows are replaced by rows from the newer dataFrame.
Let’s first create two sample dataFrames:
In [9]:
import pandas as pd
import numpy as np
def generateValues(timeSeries):
for i in range(len(timeSeries)):
yield np.random.randint(1, 10)
# Set the time series
ts1 = pd.date_range('2012-04-01', '2012-04-06')
ts2 = pd.date_range('2012-04-03', '2012-04-08')
# Generate the dataFrames
df1 = pd.DataFrame({'Date': ts1, 'Values': list(generateValues(ts1))})
df2 = pd.DataFrame({'Date': ts2, 'Values': list(generateValues(ts2))})
separation = "_" * 40
print("df1\n{0}\n\n{1}\n".format(separation,df1))
print("df2\n{0}\n\n{1}\n".format(separation,df2))
df1
________________________________________
Date Values
0 2012-04-01 7
1 2012-04-02 6
2 2012-04-03 8
3 2012-04-04 4
4 2012-04-05 9
5 2012-04-06 9
df2
________________________________________
Date Values
0 2012-04-03 6
1 2012-04-04 6
2 2012-04-05 1
3 2012-04-06 7
4 2012-04-07 7
5 2012-04-08 3
You can append the new values based on a column containing datetime values:
In [6]:
import pdHelpers
# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()
# append the new dates
newDf = helpers.appendNewDates(df1,df2,index_column='Date')
newDf
Out[6]:
| Date | Values | |
|---|---|---|
| 0 | 2012-04-01 | 6 |
| 1 | 2012-04-02 | 6 |
| 0 | 2012-04-03 | 6 |
| 1 | 2012-04-04 | 9 |
| 2 | 2012-04-05 | 4 |
| 3 | 2012-04-06 | 9 |
| 4 | 2012-04-07 | 6 |
| 5 | 2012-04-08 | 8 |
You can also append the new values based the dataFrame’s date index:
In [7]:
df1_indexed = df1.set_index('Date')
df2_indexed = df2.set_index('Date')
df_indexed = helpers.appendNewDates(df1_indexed, df2_indexed)
df_indexed
Out[7]:
| Values | |
|---|---|
| Date | |
| 2012-04-01 | 6 |
| 2012-04-02 | 6 |
| 2012-04-03 | 6 |
| 2012-04-04 | 9 |
| 2012-04-05 | 4 |
| 2012-04-06 | 9 |
| 2012-04-07 | 6 |
| 2012-04-08 | 8 |
Convert a column to datetime index¶
Converts a date column to a datetime format and sets it as a sorted index.
In [10]:
import numpy as np
import pandas as pd
import pdHelpers
# Instantiate the pandas helper module
helpers = pdHelpers.Helpers()
# Create the time series
ts = ('2012-04-0{}'.format(x) for x in range(1,9))
# Generate the dataFrame
df = pd.DataFrame(dict(Dates=list(ts), Values=[
np.random.randint(1, 10) for x in range(1,9)]))
# convert the date column and set it as index
indexed_df = helpers.setDateIndex(df,'Dates')
# Print the results
separation = "_" * 40
print("The original dataframe dtypes\n{0}".format(separation))
info = df.info()
print("\nThe original dataFrame\n{0}\n\n{1}\n".format(separation,df))
print("\nThe converted dataFrame\n{0}\n\n{1}\n".format(separation,indexed_df))
The original dataframe dtypes
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
Dates 8 non-null object
Values 8 non-null int64
dtypes: int64(1), object(1)
memory usage: 200.0+ bytes
The original dataFrame
________________________________________
Dates Values
0 2012-04-01 2
1 2012-04-02 1
2 2012-04-03 3
3 2012-04-04 3
4 2012-04-05 9
5 2012-04-06 6
6 2012-04-07 3
7 2012-04-08 1
The converted dataFrame
________________________________________
Values
Date
2012-04-01 2
2012-04-02 1
2012-04-03 3
2012-04-04 3
2012-04-05 9
2012-04-06 6
2012-04-07 3
2012-04-08 1