Accessing raw data
Accessing Raw Data¶
Raw data can be accessed through DataTables registered on platform or directly from Data Lake.
In this notebook, we will illustrate both of the approaches under following sections:
- Accessing Registered DataTables
- Accessing External Data
Accessing Registered DataTables¶
In [1]:
Copied!
# Import Corridor Package Objects
from corridor import DataTable
# Import Other Packages
import pandas as pd
# Import Corridor Package Objects
from corridor import DataTable
# Import Other Packages
import pandas as pd
In [2]:
Copied!
# Initiate Data Table Objects using alias
# This gives you a loan DataTable object which contains all the
loan = DataTable('loan')
# Initiate Data Table Objects using alias
# This gives you a loan DataTable object which contains all the
loan = DataTable('loan')
In [3]:
Copied!
# Put the Table in a Spark DataFrame format
df_loan = loan.to_spark()
# Display some information contained in the dataframe
df_loan.limit(10).select(['int_rate','loan_amnt']).show()
# Put the Table in a Spark DataFrame format
df_loan = loan.to_spark()
# Display some information contained in the dataframe
df_loan.limit(10).select(['int_rate','loan_amnt']).show()
+--------+---------+ |int_rate|loan_amnt| +--------+---------+ | 0.1699| 15650.0| | 0.1797| 12000.0| | 0.1205| 35000.0| | 0.0532| 20000.0| | 0.0531| 40000.0| | 0.1714| 3000.0| | 0.0726| 6500.0| | 0.0697| 11200.0| | 0.1602| 21000.0| | 0.0917| 20000.0| +--------+---------+
Accessing External Data¶
In [4]:
Copied!
# import spark library
import findspark; findspark.init(); import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# read in the Application Level Dataset
application_table = spark.read.parquet('s3a://corridor.dev/master/sampleAppData.parquet')
# get the first 1000 records
application_table = application_table.limit(1000)
# take a look at the first 5 rows of the application_table dataframe by runningthe line below:
application_table.limit(5).toPandas()
# import spark library
import findspark; findspark.init(); import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# read in the Application Level Dataset
application_table = spark.read.parquet('s3a://corridor.dev/master/sampleAppData.parquet')
# get the first 1000 records
application_table = application_table.limit(1000)
# take a look at the first 5 rows of the application_table dataframe by runningthe line below:
application_table.limit(5).toPandas()
Out[4]:
| corridor_application_id | acc_now_delinq | open_acc_6m | acc_open_past_24mths | addr_state | zip_code | annual_inc | corridor_application_date | application_type | simulated_age | ... | total_bal_ex_mort | total_bc_limit | tot_coll_amt | tot_cur_bal | tot_hi_cred_lim | total_bal_il | total_il_high_credit_limit | total_rev_hi_lim | all_util | __index_level_0__ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20000018440 | 0 | 0.0 | 2.0 | AZ | 852xx | 75000.0 | 2015-11-24 07:00:00 | Individual | 41 | ... | 100088.0 | 32000.0 | 73.0 | 444844.0 | 476098.0 | 67536.0 | 91314.0 | 36400.0 | 89.0 | 2122946 |
| 1 | 225770004638 | 0 | 2.0 | 8.0 | OH | 452xx | 125000.0 | 2018-07-18 07:00:00 | Individual | 36 | ... | 28524.0 | 20300.0 | 1216.0 | 166257.0 | 199832.0 | 13993.0 | 21625.0 | 26600.0 | 59.0 | 1810773 |
| 2 | 20000077834 | 0 | NaN | 3.0 | MA | 018xx | 113536.0 | 2015-10-23 07:00:00 | Individual | 45 | ... | 275761.0 | 36600.0 | 0.0 | 496816.0 | 559526.0 | NaN | 253426.0 | 51600.0 | NaN | 2182340 |
| 3 | 20000267750 | 0 | NaN | 7.0 | MD | 216xx | 140000.0 | 2015-05-22 07:00:00 | Individual | 37 | ... | 53771.0 | 100700.0 | 0.0 | 323232.0 | 439972.0 | NaN | 8831.0 | 136300.0 | NaN | 2372256 |
| 4 | 225769878100 | 0 | 1.0 | 8.0 | NC | 282xx | 130000.0 | 2018-04-25 07:00:00 | Individual | 47 | ... | 65131.0 | 56900.0 | 0.0 | 168988.0 | 247844.0 | 17643.0 | 26658.0 | 104300.0 | 50.0 | 1684235 |
5 rows × 89 columns