Skip to content

Creating Aggregates

Overview

Aggregations allow Users to summarize information at a group level (for instance User may want to calculate a variable that sums up all the payments received for a given loan account; in this case user is aggregating payments by Account ID)

Besides the common steps of creating DataElements, Features, below are the 3 main additional considerations to think about when creating Aggregations on the platform.

  • Type of Aggregation
  • Tables and columns needed
  • Definition format (python or pandas)
Types of Aggregation?

There are 2 types of Aggregations on the platform.

  • DataElement Aggregation
  • Feature Aggregation

DataElement Aggregation
DataElement Aggregation allows us to summarize information from a detailed table to the associated platform entity level.
For instance, a variable like default within last 12 month summarizes information from the account performance table detailed table to the Account platform entity level.

Feature Aggregation
Feature Aggregation allows us to summarize information from one platform entity level to another higher platform entity level. For instance, a variable like total number of accounts for customer summarizes information from the Account platform entity level to the Customer platform entity level.

Different tables available on Platform

Based on the type of aggregations, available tables, columns can differ.

DataElement Aggregation
For DataElement Aggregation, the following tables can be accessed:

  • One platform entity table at the level the Aggregation is registered
  • All the detailed tables that are associated with the platform entity level of the Aggregation

for instance, the default within last 12 month aggregation is registered at the Account platform entity level. It can access Account platform entity table and the account performance detailed table.

Feature Aggregation
For Feature Aggregation, the following tables can be accessed:

  • The platform entity table at the level the Aggregation is registered
  • All platform entity tables that at a lower level than the platform entity level of the Aggregation

for instance, the total number of accounts for customer aggregation is registered at the Customer platform entity level. It can access Customer platform entity table and all the platform entity tables that below the Customer platform entity level, namely the Prospect, Application, Account platform entity tables.

Note
One key aspect of writing Aggregates on the platform is that the groupby step is handled by the platform.
Normally in python, to create total number of accounts for customer variable, we would do:
account_table.groupby('customer_id')['account_id'].nunique().
On our platform, you can think in a way that the account_table only contains accounts for one single customer, hence no groupby is needed, we can simply do:
account_table['account_id'].nunique()

Different Aggregation Format to write deinition.

There are 2 different formats we can choose from to write the Aggregation definition.

  • pandas
  • python

Differences between the two formats:

Format of the available tables and column

  • pandas: each table is a pandas dataframe
  • python: each table is a python dictionary, each key of the dictionary represents a column in the table
# python format
account_table = {
    'account_id': [1, 2, 3] # account_id column
}

Performance

  • pandas: usually slower
  • python: generally faster, because there is no need to create a pandas dataframe

Note:

  • When defining aggregates in python format, it's not recommended to convert tables to pandas dataframe due to performance concern
Example
Example 1

Total number of accounts for customer

Choose Aggregation Type:
Feature Aggregation: Summarize information from Account platform entity level to Customer platform entity level

Select Tables, Columns needed
Inputs: Account:account_id

Define the Aggregation

  • pandas format
# account is a pandas dataframe
return account['account_id'].nunique()
  • python format
# account is a data dictionary
return len(set(account['account_id']))
Example 2

Default in the next 12 month of application

Choose Aggregation Type:
DataElement Aggregation: Summarize information from Account Performance detail table to Account level

Select Tables, Columns needed
Source Tables: account_performance, account
Source Columns: account_performance:record_date, account_performance:charge_off_amount, account:account_application_date

Define the Aggregation

  • pandas format
import datetime

# get the account records with `record_date` within 1 year of `account_application_date`
application_date = account['account_application_date'].iat[0]
account_perf_within_1_year = account_performance[
    (application_date < account_performance['record_date']) &
    (account_performance['record_date'] - application_date < datetime.timedelta(days=365))
    ]

# drop the records that doesn't have charge off amount
account_perf_with_co_amt = account_perf_within_1_year.dropna(subset=['charge_off_amount'])

if account_perf_with_co_amt.empty:
    return 0

if account_perf_with_co_amt.sum() > 0:
    return 1
else:
    return 0
  • python format
import datetime

# sum charge_off_amount for the records with `record_date` within 1 year of `account_application_date`
application_date = account['account_application_date'][0]
total_charge_off_amount = 0
for date, co_amt in zip(account_performance['record_date'], account_performance['charge_off_amount']):
    if (date >= application_date) and (date - application_date < datetime.timedelta(days = 365)) and co_amt:
        total_charge_off_amount += co_amt

# default if total_charge_off_amount > 0
if not total_charge_off_amount:
    return 0
elif total_charge_off_amount > 0:
    return 1
else:
    return 0
Example 3

Total credit card balance

Choose Aggregation Type:
DataElement Aggregation: Summarize information from Account Performance detail table to Account level

Select Tables, Columns needed
Source Tables: account_performance
Source Columns: account_performance:account_type, account_performance:balance

Define the Aggregation

  • pandas format
# drop the records that doesn't have either account_type or balance
account_performance = account_performance.dropna(subset=['account_type', 'balance'])
if account_performance.empty:
    return None

# create credit card account flag
account_performance['is_credit_card'] = account_performance['account_type'] == 'credit_card'

# return total credit card balance
return (account_performance['is_credit_card'] * account_performance['balance']).sum()
  • python format
# loop through the type, balance pair, add up balance if type is credit card
total_balance = 0
none_flag = True
for type_, bal in zip(account_performance['account_type'], account_performance['balance']):
    if None in (type_, bal):
        continue
    else:
        none_flag = False
        if type_ == 'credit_card':
            total_balance += bal

if none_flag:
    return None
else:
    return total_balance