Creating datasets

Creating Datasets¶

create_data function from the corridor package can be used to create datasets by accessing registered objects on the platform

Below is a list of registered objects that are available through the Platform:

DataElement
Feature
Model
ModelTransform
Policy
GlobalFunction

In following examples, we will illustrate accessing objects using create_data. We illustrate following cases :

Create Dataset Using Object Alias
Create Dataset With Specific Versions of DE/Features
Create Dataset With ModelTransform
Create Dataset With Runtime Parameters
Create Dataset with Different Runtime Parameters
Create Dataset From Bureau Attributes
Create Dataset Across Different Platform Entity
Create Dataset from Simple DataElement/Feature Using External Dataset Outside the Platform (Discussed in 3.b Running Simulations)
Create Dataset from Aggregated DataElement/Feature Using External Data Outside the Platform (Discussed in 3.b Running Simulations)
Access registered globalfunctions and use it like python function (Discussed in 2.b. Accessing Registered Objects.ipynb)

Create Dataset Using Object Alias¶

In [2]:

Copied!





# Import Corridor Package Objects
from corridor import create_data

# Create dataset using DE alias (requested_loan_amount) and Feature alias (debt_capacity)
df = create_data('requested_loan_amount','debt_capacity')
df.limit(5).toPandas()
# Import Corridor Package Objects
from corridor import create_data

# Create dataset using DE alias (requested_loan_amount) and Feature alias (debt_capacity)
df = create_data('requested_loan_amount','debt_capacity')
df.limit(5).toPandas()

Out[2]:

	requested_loan_amount	debt_capacity
0	14650.0	0.195333
1	11000.0	0.088000
2	37000.0	0.325888
3	20000.0	0.142857
4	40000.0	0.307692

Create Dataset With Specific Versions of DE/Features¶

In [3]:

Copied!





# Import Corridor Package Objects
from corridor import create_data
from corridor import Feature
from corridor import DataElement


# Build out the DataElement/Feature with specific versions
age_of_credit_file_v2 = Feature('age_of_credit_file', version=2)
annual_income_v1 = DataElement('annual_income', version=1)

# Create dataset with that with version
df = create_data(age_of_credit_file_v2,annual_income_v1)
df.limit(5).toPandas()

# age_of_credit_file_default = Feature('age_of_credit_file')
# age_of_credit_file_v2 = Feature('age_of_credit_file', version=2)
# Import Corridor Package Objects
from corridor import create_data
from corridor import Feature
from corridor import DataElement


# Build out the DataElement/Feature with specific versions
age_of_credit_file_v2 = Feature('age_of_credit_file', version=2)
annual_income_v1 = DataElement('annual_income', version=1)

# Create dataset with that with version
df = create_data(age_of_credit_file_v2,annual_income_v1)
df.limit(5).toPandas()

# age_of_credit_file_default = Feature('age_of_credit_file')
# age_of_credit_file_v2 = Feature('age_of_credit_file', version=2)

Out[3]:

	age_of_credit_file	annual_income
0	231.0	75000.0
1	225.0	125000.0
2	171.0	113536.0
3	314.0	140000.0
4	430.0	130000.0

Create Dataset With ModelTranform¶

In [ ]:

Copied!





# Import Corridor Package Objects
from corridor import create_data
from corridor import DataElement
from corridor import ModelTransform

# Create data tables using create_data
de_fico = DataElement(alias='fico')
fico_transform = ModelTransform(model='PD Model with Transform',alias = 'fico_normalized')
data_table= create_data([de_fico,fico_transform])

# Display top 5 rows from created data table
data_table.show(5)
# Import Corridor Package Objects
from corridor import create_data
from corridor import DataElement
from corridor import ModelTransform

# Create data tables using create_data
de_fico = DataElement(alias='fico')
fico_transform = ModelTransform(model='PD Model with Transform',alias = 'fico_normalized')
data_table= create_data([de_fico,fico_transform])

# Display top 5 rows from created data table
data_table.show(5)

+-----+---------------+
| fico|fico_normalized|
+-----+---------------+
|699.0|          0.699|
|709.0|          0.709|
|749.0|          0.749|
|699.0|          0.699|
|674.0|          0.674|
+-----+---------------+
only showing top 5 rows

In case of conflicting data column names the transform names are suffiex with special keywords to reatin unique constraint on dataframe columns and help users identify columns generated using transforms

In [ ]:

Copied!





tr_fico = ModelTransform(model='PD Model with Transform',alias = 'fico_normalized')
data_table = create_data([de_fico,tr_fico])

# ModelTransform aliased fico is suffiexed with model alias 'prob_default_transform'
# since both DataElement & ModelTransform have conflicting alias
data_table.show(5)
tr_fico = ModelTransform(model='PD Model with Transform',alias = 'fico_normalized')
data_table = create_data([de_fico,tr_fico])

# ModelTransform aliased fico is suffiexed with model alias 'prob_default_transform'
# since both DataElement & ModelTransform have conflicting alias
data_table.show(5)

+-----+---------------------------+
| fico|fico_prob_default_transform|
+-----+---------------------------+
|699.0|                      699.0|
|709.0|                      709.0|
|749.0|                      749.0|
|699.0|                      699.0|
|674.0|                      674.0|
+-----+---------------------------+
only showing top 5 rows

Create Dataset With Runtime Parameters¶

In [4]:

Copied!





# Import Corridor Package Objects
from corridor import create_data
# default_flag_with_rp requires runtime parameter `default_month_period`
loan_spark = create_data('default_flag_with_rp', runtime_params= {'default_month_period': 24})
loan_spark.show(2)
# Import Corridor Package Objects
from corridor import create_data
# default_flag_with_rp requires runtime parameter `default_month_period`
loan_spark = create_data('default_flag_with_rp', runtime_params= {'default_month_period': 24})
loan_spark.show(2)

+--------------------+
|default_flag_with_rp|
+--------------------+
|                 0.0|
|                 0.0|
+--------------------+
only showing top 2 rows

Create Dataset with Different Runtime Parameters¶

In [5]:

Copied!





from corridor import create_data

# define a set of runtime parameters
default_months = [12, 18]

# for pyspark
# create dataset for each score date
loan_spark_dfs = []
for mth in default_months:
    loan_spark = create_data('default_flag_with_rp', 'corridor_loan_id', 
                             runtime_params= {'default_month_period': mth})
    loan_spark = loan_spark.withColumnRenamed('default_flag_with_rp', f'default_within_{mth}mth')
    loan_spark_dfs.append(loan_spark)


starting_table = loan_spark_dfs[0]
for spark_df in loan_spark_dfs[1:]:
    starting_table = starting_table.join(spark_df, on='corridor_loan_id')
    
starting_table.show(2)

# # for python
# # create dataset for each score date
# loan_python_dfs = []
# for mth in default_months:
#     loan_python = create_data('default_flag_with_rp', 'corridor_loan_id', runtime_params= {'default_month_period': mth}).toPandas()
#     loan_python = loan_spark.rename(columns={'default_flag_with_rp': f'default_within_{mth}mth'})
#     loan_python_dfs.append(loan_python)


# starting_table = loan_python_dfs[0]
# for python_df in loan_python_dfs[1:]:
#     starting_table = starting_table.merge(python_df, on='corridor_loan_id')
    
# starting_table.head(2)
from corridor import create_data

# define a set of runtime parameters
default_months = [12, 18]

# for pyspark
# create dataset for each score date
loan_spark_dfs = []
for mth in default_months:
    loan_spark = create_data('default_flag_with_rp', 'corridor_loan_id', 
                             runtime_params= {'default_month_period': mth})
    loan_spark = loan_spark.withColumnRenamed('default_flag_with_rp', f'default_within_{mth}mth')
    loan_spark_dfs.append(loan_spark)


starting_table = loan_spark_dfs[0]
for spark_df in loan_spark_dfs[1:]:
    starting_table = starting_table.join(spark_df, on='corridor_loan_id')
    
starting_table.show(2)

# # for python
# # create dataset for each score date
# loan_python_dfs = []
# for mth in default_months:
#     loan_python = create_data('default_flag_with_rp', 'corridor_loan_id', runtime_params= {'default_month_period': mth}).toPandas()
#     loan_python = loan_spark.rename(columns={'default_flag_with_rp': f'default_within_{mth}mth'})
#     loan_python_dfs.append(loan_python)


# starting_table = loan_python_dfs[0]
# for python_df in loan_python_dfs[1:]:
#     starting_table = starting_table.merge(python_df, on='corridor_loan_id')
    
# starting_table.head(2)

+----------------+--------------------+--------------------+
|corridor_loan_id|default_within_12mth|default_within_18mth|
+----------------+--------------------+--------------------+
|     46123043808|                 0.0|                 0.0|
|     45064090452|                 0.0|                 0.0|
+----------------+--------------------+--------------------+
only showing top 2 rows

Create Dataset From Bureau Attributes¶

In [6]:

Copied!

from corridor import DataElement,Feature
from corridor import DataElement,Feature

In [7]:

Copied!





# get the list of DataElements, Features that on Application level
application_lvl_des = [de_alias for de_alias in DataElement.elements if DataElement(de_alias).group == 'Credit Bureau Data' and DataElement(de_alias).platform_entity == 'Application']
application_lvl_fts = [ft_alias for ft_alias in Feature.elements if Feature(ft_alias).group == 'Credit Bureau' and Feature(ft_alias).platform_entity == 'Application']
application_lvl_bureau_attr = application_lvl_des + application_lvl_fts
application_lvl_bureau_attr
# get the list of DataElements, Features that on Application level
application_lvl_des = [de_alias for de_alias in DataElement.elements if DataElement(de_alias).group == 'Credit Bureau Data' and DataElement(de_alias).platform_entity == 'Application']
application_lvl_fts = [ft_alias for ft_alias in Feature.elements if Feature(ft_alias).group == 'Credit Bureau' and Feature(ft_alias).platform_entity == 'Application']
application_lvl_bureau_attr = application_lvl_des + application_lvl_fts
application_lvl_bureau_attr

Out[7]:

['fico_range_high',
 'earliest_credit_line',
 'num_open_accts',
 'inq_last_6m',
 'revolving_utilization',
 'age_of_credit_file',
 'annual_income_weights_app',
 'feature_test_for_ppv',
 'inquiry_intensity',
 'income_gv',
 'age_of_credit_file_v1',
 'feat_4227_test1']

In [8]:

Copied!





# create dataset using application data on the platform with a subset of bureau data
from corridor import create_data
application_lvl_data = create_data(application_lvl_bureau_attr[:5])
application_lvl_data.limit(2).toPandas()
# create dataset using application data on the platform with a subset of bureau data
from corridor import create_data
application_lvl_data = create_data(application_lvl_bureau_attr[:5])
application_lvl_data.limit(2).toPandas()

Out[8]:

	fico_range_high	earliest_credit_line	num_open_accts	inq_last_6m	revolving_utilization
0	674.0	1996-08-01 07:00:00	7.0	1.0	89.4
1	699.0	1999-10-01 07:00:00	10.0	1.0	54.6

Create Dataset Across Different Platform Entity¶

In [9]:

Copied!





# DataElemnts to get on Application platform entity
des_application = ['requested_loan_amount', 'application_date', 'num_open_accts', 'corridor_application_id']

# Features to get on Loan platform entity
# note default_within_18m is aggregate feature
fts_loan = ['loan_default_in_18m', 'corridor_loan_id']
# DataElemnts to get on Application platform entity
des_application = ['requested_loan_amount', 'application_date', 'num_open_accts', 'corridor_application_id']

# Features to get on Loan platform entity
# note default_within_18m is aggregate feature
fts_loan = ['loan_default_in_18m', 'corridor_loan_id']

In [10]:

Copied!





# create dataset for application level features
app_data = create_data(des_application)

# create dataset for loan level features
loan_data = create_data(fts_loan).limit(100)
 
# get the linking app_to_loan table
from corridor import DataTable
app_to_loan = DataTable('application_to_loan').to_spark()
# create dataset for application level features
app_data = create_data(des_application)

# create dataset for loan level features
loan_data = create_data(fts_loan).limit(100)
 
# get the linking app_to_loan table
from corridor import DataTable
app_to_loan = DataTable('application_to_loan').to_spark()

In [11]:

Copied!

app_data.show(2)
app_data.show(2)

+---------------------+-------------------+--------------+-----------------------+
|requested_loan_amount|   application_date|num_open_accts|corridor_application_id|
+---------------------+-------------------+--------------+-----------------------+
|              14650.0|2015-11-24 07:00:00|           7.0|            20000018440|
|              11000.0|2018-07-18 07:00:00|          10.0|           225770004638|
+---------------------+-------------------+--------------+-----------------------+
only showing top 2 rows

In [12]:

Copied!

loan_data.show(2)
loan_data.show(2)

+-------------------+----------------+
|loan_default_in_18m|corridor_loan_id|
+-------------------+----------------+
|                0.0|     46123043808|
|                0.0|     44038065001|
+-------------------+----------------+
only showing top 2 rows

In [13]:

Copied!

app_to_loan.show(2)
app_to_loan.show(2)

+-----------------------+----------------+
|corridor_application_id|corridor_loan_id|
+-----------------------+----------------+
|            20000018440|     48043066575|
|           225770004638|     45115043587|
+-----------------------+----------------+
only showing top 2 rows

In [14]:

Copied!





# pyspark version
# join the loan_data, app_data and app_to_loan table
final_data = loan_data.join(app_to_loan, on='corridor_loan_id').join(app_data, on='corridor_application_id')
final_data.limit(2).toPandas()

# # python version
# loan_data_pd = loan_data.toPandas()
# app_to_loan_pd = app_to_loan.toPandas()
# app_data_pd = app_data.toPandas()

# # join loan_data, app_data and app_to_loan table
# final_data_pd = loan_data_pd.merge(app_to_loan_pd, on='corridor_loan_id').merge(app_data_pd, on='corridor_application_id')
# final_data_pd.head(2)
# pyspark version
# join the loan_data, app_data and app_to_loan table
final_data = loan_data.join(app_to_loan, on='corridor_loan_id').join(app_data, on='corridor_application_id')
final_data.limit(2).toPandas()

# # python version
# loan_data_pd = loan_data.toPandas()
# app_to_loan_pd = app_to_loan.toPandas()
# app_data_pd = app_data.toPandas()

# # join loan_data, app_data and app_to_loan table
# final_data_pd = loan_data_pd.merge(app_to_loan_pd, on='corridor_loan_id').merge(app_data_pd, on='corridor_application_id')
# final_data_pd.head(2)

Out[14]:

	corridor_application_id	corridor_loan_id	loan_default_in_18m	requested_loan_amount	application_date	num_open_accts
0	217179956781	45064090452	0.0	21000.0	2013-05-19 07:00:00	11.0
1	234359969058	47167031085	0.0	20000.0	2017-09-16 07:00:00	6.0

In [ ]: