Accessing raw data

Accessing Raw Data¶

Raw data can be accessed through DataTables registered on platform or directly from Data Lake.
In this notebook, we will illustrate both of the approaches under following sections:

Accessing Registered DataTables
Accessing External Data

Accessing Registered DataTables¶

In [1]:

Copied!

# Import Corridor Package Objects
from corridor import DataTable

# Import Other Packages
import pandas as pd
# Import Corridor Package Objects
from corridor import DataTable

# Import Other Packages
import pandas as pd

In [2]:

Copied!

# Initiate Data Table Objects using alias
# This gives you a loan DataTable object which contains all the 
loan = DataTable('loan')
# Initiate Data Table Objects using alias
# This gives you a loan DataTable object which contains all the 
loan = DataTable('loan')

In [3]:

Copied!

# Put the Table in a Spark DataFrame format
df_loan = loan.to_spark()

# Display some information contained in the dataframe
df_loan.limit(10).select(['int_rate','loan_amnt']).show()
# Put the Table in a Spark DataFrame format
df_loan = loan.to_spark()

# Display some information contained in the dataframe
df_loan.limit(10).select(['int_rate','loan_amnt']).show()

+--------+---------+
|int_rate|loan_amnt|
+--------+---------+
|  0.1699|  15650.0|
|  0.1797|  12000.0|
|  0.1205|  35000.0|
|  0.0532|  20000.0|
|  0.0531|  40000.0|
|  0.1714|   3000.0|
|  0.0726|   6500.0|
|  0.0697|  11200.0|
|  0.1602|  21000.0|
|  0.0917|  20000.0|
+--------+---------+

Accessing External Data¶

In [4]:

Copied!





# import spark library
import findspark; findspark.init(); import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()

# read in the Application Level Dataset
application_table = spark.read.parquet('s3a://corridor.dev/master/sampleAppData.parquet')
# get the first 1000 records 
application_table = application_table.limit(1000)

# take a look at the first 5 rows of the application_table dataframe by runningthe line below:
application_table.limit(5).toPandas()
# import spark library
import findspark; findspark.init(); import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()

# read in the Application Level Dataset
application_table = spark.read.parquet('s3a://corridor.dev/master/sampleAppData.parquet')
# get the first 1000 records 
application_table = application_table.limit(1000)

# take a look at the first 5 rows of the application_table dataframe by runningthe line below:
application_table.limit(5).toPandas()

Out[4]:

	corridor_application_id	open_acc_6m	acc_open_past_24mths	addr_state	zip_code	annual_inc	corridor_application_date	application_type	simulated_age	...	total_bal_ex_mort	total_bc_limit	tot_coll_amt	tot_cur_bal	tot_hi_cred_lim	total_bal_il	total_il_high_credit_limit	total_rev_hi_lim	all_util	__index_level_0__
0	20000018440	0.0	2.0	AZ	852xx	75000.0	2015-11-24 07:00:00	Individual	41	...	100088.0	32000.0	73.0	444844.0	476098.0	67536.0	91314.0	36400.0	89.0	2122946
1	225770004638	2.0	8.0	OH	452xx	125000.0	2018-07-18 07:00:00	Individual	36	...	28524.0	20300.0	1216.0	166257.0	199832.0	13993.0	21625.0	26600.0	59.0	1810773
2	20000077834	NaN	3.0	MA	018xx	113536.0	2015-10-23 07:00:00	Individual	45	...	275761.0	36600.0	0.0	496816.0	559526.0	NaN	253426.0	51600.0	NaN	2182340
3	20000267750	NaN	7.0	MD	216xx	140000.0	2015-05-22 07:00:00	Individual	37	...	53771.0	100700.0	0.0	323232.0	439972.0	NaN	8831.0	136300.0	NaN	2372256
4	225769878100	1.0	8.0	NC	282xx	130000.0	2018-04-25 07:00:00	Individual	47	...	65131.0	56900.0	0.0	168988.0	247844.0	17643.0	26658.0	104300.0	50.0	1684235

5 rows × 89 columns