Creating datasets
Creating Datasets¶
create_data function from the corridor package can be used to create datasets by accessing registered objects on the platform
Below is a list of registered objects that are available through the Platform:
- DataElement
- Feature
- Model
- ModelTransform
- Policy
- GlobalFunction
In following examples, we will illustrate accessing objects using create_data. We illustrate following cases :
- Create Dataset Using Object Alias
- Create Dataset With Specific Versions of DE/Features
- Create Dataset With ModelTransform
- Create Dataset With Runtime Parameters
- Create Dataset with Different Runtime Parameters
- Create Dataset From Bureau Attributes
- Create Dataset Across Different Platform Entity
- Create Dataset from Simple DataElement/Feature Using External Dataset Outside the Platform (Discussed in
3.b Running Simulations) - Create Dataset from Aggregated DataElement/Feature Using External Data Outside the Platform (Discussed in
3.b Running Simulations) - Access registered globalfunctions and use it like python function (Discussed in
2.b. Accessing Registered Objects.ipynb)
Create Dataset Using Object Alias¶
In [2]:
Copied!
# Import Corridor Package Objects
from corridor import create_data
# Create dataset using DE alias (requested_loan_amount) and Feature alias (debt_capacity)
df = create_data('requested_loan_amount','debt_capacity')
df.limit(5).toPandas()
# Import Corridor Package Objects
from corridor import create_data
# Create dataset using DE alias (requested_loan_amount) and Feature alias (debt_capacity)
df = create_data('requested_loan_amount','debt_capacity')
df.limit(5).toPandas()
Out[2]:
| requested_loan_amount | debt_capacity | |
|---|---|---|
| 0 | 14650.0 | 0.195333 |
| 1 | 11000.0 | 0.088000 |
| 2 | 37000.0 | 0.325888 |
| 3 | 20000.0 | 0.142857 |
| 4 | 40000.0 | 0.307692 |
Create Dataset With Specific Versions of DE/Features¶
In [3]:
Copied!
# Import Corridor Package Objects
from corridor import create_data
from corridor import Feature
from corridor import DataElement
# Build out the DataElement/Feature with specific versions
age_of_credit_file_v2 = Feature('age_of_credit_file', version=2)
annual_income_v1 = DataElement('annual_income', version=1)
# Create dataset with that with version
df = create_data(age_of_credit_file_v2,annual_income_v1)
df.limit(5).toPandas()
# age_of_credit_file_default = Feature('age_of_credit_file')
# age_of_credit_file_v2 = Feature('age_of_credit_file', version=2)
# Import Corridor Package Objects
from corridor import create_data
from corridor import Feature
from corridor import DataElement
# Build out the DataElement/Feature with specific versions
age_of_credit_file_v2 = Feature('age_of_credit_file', version=2)
annual_income_v1 = DataElement('annual_income', version=1)
# Create dataset with that with version
df = create_data(age_of_credit_file_v2,annual_income_v1)
df.limit(5).toPandas()
# age_of_credit_file_default = Feature('age_of_credit_file')
# age_of_credit_file_v2 = Feature('age_of_credit_file', version=2)
Out[3]:
| age_of_credit_file | annual_income | |
|---|---|---|
| 0 | 231.0 | 75000.0 |
| 1 | 225.0 | 125000.0 |
| 2 | 171.0 | 113536.0 |
| 3 | 314.0 | 140000.0 |
| 4 | 430.0 | 130000.0 |
Create Dataset With ModelTranform¶
In [ ]:
Copied!
# Import Corridor Package Objects
from corridor import create_data
from corridor import DataElement
from corridor import ModelTransform
# Create data tables using create_data
de_fico = DataElement(alias='fico')
fico_transform = ModelTransform(model='PD Model with Transform',alias = 'fico_normalized')
data_table= create_data([de_fico,fico_transform])
# Display top 5 rows from created data table
data_table.show(5)
# Import Corridor Package Objects
from corridor import create_data
from corridor import DataElement
from corridor import ModelTransform
# Create data tables using create_data
de_fico = DataElement(alias='fico')
fico_transform = ModelTransform(model='PD Model with Transform',alias = 'fico_normalized')
data_table= create_data([de_fico,fico_transform])
# Display top 5 rows from created data table
data_table.show(5)
+-----+---------------+ | fico|fico_normalized| +-----+---------------+ |699.0| 0.699| |709.0| 0.709| |749.0| 0.749| |699.0| 0.699| |674.0| 0.674| +-----+---------------+ only showing top 5 rows
In case of conflicting data column names the transform names are suffiex with special keywords to reatin unique constraint on dataframe columns and help users identify columns generated using transforms
In [ ]:
Copied!
tr_fico = ModelTransform(model='PD Model with Transform',alias = 'fico_normalized')
data_table = create_data([de_fico,tr_fico])
# ModelTransform aliased fico is suffiexed with model alias 'prob_default_transform'
# since both DataElement & ModelTransform have conflicting alias
data_table.show(5)
tr_fico = ModelTransform(model='PD Model with Transform',alias = 'fico_normalized')
data_table = create_data([de_fico,tr_fico])
# ModelTransform aliased fico is suffiexed with model alias 'prob_default_transform'
# since both DataElement & ModelTransform have conflicting alias
data_table.show(5)
+-----+---------------------------+ | fico|fico_prob_default_transform| +-----+---------------------------+ |699.0| 699.0| |709.0| 709.0| |749.0| 749.0| |699.0| 699.0| |674.0| 674.0| +-----+---------------------------+ only showing top 5 rows
Create Dataset With Runtime Parameters¶
In [4]:
Copied!
# Import Corridor Package Objects
from corridor import create_data
# default_flag_with_rp requires runtime parameter `default_month_period`
loan_spark = create_data('default_flag_with_rp', runtime_params= {'default_month_period': 24})
loan_spark.show(2)
# Import Corridor Package Objects
from corridor import create_data
# default_flag_with_rp requires runtime parameter `default_month_period`
loan_spark = create_data('default_flag_with_rp', runtime_params= {'default_month_period': 24})
loan_spark.show(2)
+--------------------+ |default_flag_with_rp| +--------------------+ | 0.0| | 0.0| +--------------------+ only showing top 2 rows
Create Dataset with Different Runtime Parameters¶
In [5]:
Copied!
from corridor import create_data
# define a set of runtime parameters
default_months = [12, 18]
# for pyspark
# create dataset for each score date
loan_spark_dfs = []
for mth in default_months:
loan_spark = create_data('default_flag_with_rp', 'corridor_loan_id',
runtime_params= {'default_month_period': mth})
loan_spark = loan_spark.withColumnRenamed('default_flag_with_rp', f'default_within_{mth}mth')
loan_spark_dfs.append(loan_spark)
starting_table = loan_spark_dfs[0]
for spark_df in loan_spark_dfs[1:]:
starting_table = starting_table.join(spark_df, on='corridor_loan_id')
starting_table.show(2)
# # for python
# # create dataset for each score date
# loan_python_dfs = []
# for mth in default_months:
# loan_python = create_data('default_flag_with_rp', 'corridor_loan_id', runtime_params= {'default_month_period': mth}).toPandas()
# loan_python = loan_spark.rename(columns={'default_flag_with_rp': f'default_within_{mth}mth'})
# loan_python_dfs.append(loan_python)
# starting_table = loan_python_dfs[0]
# for python_df in loan_python_dfs[1:]:
# starting_table = starting_table.merge(python_df, on='corridor_loan_id')
# starting_table.head(2)
from corridor import create_data
# define a set of runtime parameters
default_months = [12, 18]
# for pyspark
# create dataset for each score date
loan_spark_dfs = []
for mth in default_months:
loan_spark = create_data('default_flag_with_rp', 'corridor_loan_id',
runtime_params= {'default_month_period': mth})
loan_spark = loan_spark.withColumnRenamed('default_flag_with_rp', f'default_within_{mth}mth')
loan_spark_dfs.append(loan_spark)
starting_table = loan_spark_dfs[0]
for spark_df in loan_spark_dfs[1:]:
starting_table = starting_table.join(spark_df, on='corridor_loan_id')
starting_table.show(2)
# # for python
# # create dataset for each score date
# loan_python_dfs = []
# for mth in default_months:
# loan_python = create_data('default_flag_with_rp', 'corridor_loan_id', runtime_params= {'default_month_period': mth}).toPandas()
# loan_python = loan_spark.rename(columns={'default_flag_with_rp': f'default_within_{mth}mth'})
# loan_python_dfs.append(loan_python)
# starting_table = loan_python_dfs[0]
# for python_df in loan_python_dfs[1:]:
# starting_table = starting_table.merge(python_df, on='corridor_loan_id')
# starting_table.head(2)
+----------------+--------------------+--------------------+ |corridor_loan_id|default_within_12mth|default_within_18mth| +----------------+--------------------+--------------------+ | 46123043808| 0.0| 0.0| | 45064090452| 0.0| 0.0| +----------------+--------------------+--------------------+ only showing top 2 rows
Create Dataset From Bureau Attributes¶
In [6]:
Copied!
from corridor import DataElement,Feature
from corridor import DataElement,Feature
In [7]:
Copied!
# get the list of DataElements, Features that on Application level
application_lvl_des = [de_alias for de_alias in DataElement.elements if DataElement(de_alias).group == 'Credit Bureau Data' and DataElement(de_alias).platform_entity == 'Application']
application_lvl_fts = [ft_alias for ft_alias in Feature.elements if Feature(ft_alias).group == 'Credit Bureau' and Feature(ft_alias).platform_entity == 'Application']
application_lvl_bureau_attr = application_lvl_des + application_lvl_fts
application_lvl_bureau_attr
# get the list of DataElements, Features that on Application level
application_lvl_des = [de_alias for de_alias in DataElement.elements if DataElement(de_alias).group == 'Credit Bureau Data' and DataElement(de_alias).platform_entity == 'Application']
application_lvl_fts = [ft_alias for ft_alias in Feature.elements if Feature(ft_alias).group == 'Credit Bureau' and Feature(ft_alias).platform_entity == 'Application']
application_lvl_bureau_attr = application_lvl_des + application_lvl_fts
application_lvl_bureau_attr
Out[7]:
['fico_range_high', 'earliest_credit_line', 'num_open_accts', 'inq_last_6m', 'revolving_utilization', 'age_of_credit_file', 'annual_income_weights_app', 'feature_test_for_ppv', 'inquiry_intensity', 'income_gv', 'age_of_credit_file_v1', 'feat_4227_test1']
In [8]:
Copied!
# create dataset using application data on the platform with a subset of bureau data
from corridor import create_data
application_lvl_data = create_data(application_lvl_bureau_attr[:5])
application_lvl_data.limit(2).toPandas()
# create dataset using application data on the platform with a subset of bureau data
from corridor import create_data
application_lvl_data = create_data(application_lvl_bureau_attr[:5])
application_lvl_data.limit(2).toPandas()
Out[8]:
| fico_range_high | earliest_credit_line | num_open_accts | inq_last_6m | revolving_utilization | |
|---|---|---|---|---|---|
| 0 | 674.0 | 1996-08-01 07:00:00 | 7.0 | 1.0 | 89.4 |
| 1 | 699.0 | 1999-10-01 07:00:00 | 10.0 | 1.0 | 54.6 |
Create Dataset Across Different Platform Entity¶
In [9]:
Copied!
# DataElemnts to get on Application platform entity
des_application = ['requested_loan_amount', 'application_date', 'num_open_accts', 'corridor_application_id']
# Features to get on Loan platform entity
# note default_within_18m is aggregate feature
fts_loan = ['loan_default_in_18m', 'corridor_loan_id']
# DataElemnts to get on Application platform entity
des_application = ['requested_loan_amount', 'application_date', 'num_open_accts', 'corridor_application_id']
# Features to get on Loan platform entity
# note default_within_18m is aggregate feature
fts_loan = ['loan_default_in_18m', 'corridor_loan_id']
In [10]:
Copied!
# create dataset for application level features
app_data = create_data(des_application)
# create dataset for loan level features
loan_data = create_data(fts_loan).limit(100)
# get the linking app_to_loan table
from corridor import DataTable
app_to_loan = DataTable('application_to_loan').to_spark()
# create dataset for application level features
app_data = create_data(des_application)
# create dataset for loan level features
loan_data = create_data(fts_loan).limit(100)
# get the linking app_to_loan table
from corridor import DataTable
app_to_loan = DataTable('application_to_loan').to_spark()
In [11]:
Copied!
app_data.show(2)
app_data.show(2)
+---------------------+-------------------+--------------+-----------------------+ |requested_loan_amount| application_date|num_open_accts|corridor_application_id| +---------------------+-------------------+--------------+-----------------------+ | 14650.0|2015-11-24 07:00:00| 7.0| 20000018440| | 11000.0|2018-07-18 07:00:00| 10.0| 225770004638| +---------------------+-------------------+--------------+-----------------------+ only showing top 2 rows
In [12]:
Copied!
loan_data.show(2)
loan_data.show(2)
+-------------------+----------------+ |loan_default_in_18m|corridor_loan_id| +-------------------+----------------+ | 0.0| 46123043808| | 0.0| 44038065001| +-------------------+----------------+ only showing top 2 rows
In [13]:
Copied!
app_to_loan.show(2)
app_to_loan.show(2)
+-----------------------+----------------+ |corridor_application_id|corridor_loan_id| +-----------------------+----------------+ | 20000018440| 48043066575| | 225770004638| 45115043587| +-----------------------+----------------+ only showing top 2 rows
In [14]:
Copied!
# pyspark version
# join the loan_data, app_data and app_to_loan table
final_data = loan_data.join(app_to_loan, on='corridor_loan_id').join(app_data, on='corridor_application_id')
final_data.limit(2).toPandas()
# # python version
# loan_data_pd = loan_data.toPandas()
# app_to_loan_pd = app_to_loan.toPandas()
# app_data_pd = app_data.toPandas()
# # join loan_data, app_data and app_to_loan table
# final_data_pd = loan_data_pd.merge(app_to_loan_pd, on='corridor_loan_id').merge(app_data_pd, on='corridor_application_id')
# final_data_pd.head(2)
# pyspark version
# join the loan_data, app_data and app_to_loan table
final_data = loan_data.join(app_to_loan, on='corridor_loan_id').join(app_data, on='corridor_application_id')
final_data.limit(2).toPandas()
# # python version
# loan_data_pd = loan_data.toPandas()
# app_to_loan_pd = app_to_loan.toPandas()
# app_data_pd = app_data.toPandas()
# # join loan_data, app_data and app_to_loan table
# final_data_pd = loan_data_pd.merge(app_to_loan_pd, on='corridor_loan_id').merge(app_data_pd, on='corridor_application_id')
# final_data_pd.head(2)
Out[14]:
| corridor_application_id | corridor_loan_id | loan_default_in_18m | requested_loan_amount | application_date | num_open_accts | |
|---|---|---|---|---|---|---|
| 0 | 217179956781 | 45064090452 | 0.0 | 21000.0 | 2013-05-19 07:00:00 | 11.0 |
| 1 | 234359969058 | 47167031085 | 0.0 | 20000.0 | 2017-09-16 07:00:00 | 6.0 |
In [ ]:
Copied!