Data in transactional format

USER_ID CAT STATE_CD CAMPAIGN_DATE CAMPAIGN_CD RESPONSE

1001 A MA 01-MAY-12 A N

1001 A MA 08-MAY-12 A N

1001 A MA 15-MAY-12 A Y

1001 A MA 22-MAY-12 A N

1001 A MA 29-MAY-12 A N

1001 A MA 06-JUN-12 A N

1001 B CT 06-JUN-12 A N

1002 B CT 01-MAY-12 A N

1002 B CT 08-MAY-12 A N

1002 B CT 15-MAY-12 A Y

1002 B CT 22-MAY-12 A N

1002 B CT 29-MAY-12 A Y

If all the independent variables are categorical, we can convert the data in transactional format into a more compact one by summarizing the data using SQL script similar to the following. We count the numbers of responses and non responses for each unique combination of independent variable values. For continuous variables, if we want, we can transform them into categorical using techniques like binning.

*select cat, state_cd, campgain_cd,*

*sum(case when response='Y' then 1 else 0 end) num_response,*

*sum(case when response='N' then 1 else 0 end) num_no_response*

*from tbl_txn group by cat, state_cd, campgain_cd;*

Data in the summary format

CAT STATE_CD CAMPAIGN_CD NUM_RESPONSE NUM_NO_RESPONSE

A MA A 125 1025

B CT C 75 2133

..........................

Summarizing data first can greatly reduce the data size and save memory space when building the model. This is particularly useful if we are use memory-based modeling tools such as R.

If we use R to build the logistic regression model, the script for training data in transactional format is similar to the following.

*glm(formula=RESPONSE~CAT+STATE_CD+CAMPAIGN_CD,*

*data=train.set1,family = binomial(link = "logit")) ->model1*

The R scripts for building a logistic model based on summary data is show below.

*glm(formula= cbind(NUM_RESPONSE,NUM_NO_RESPONSE) ~CAT+STATE_CD+CAMPAIGN_CD,*

*data=train.set1,family = binomial(link = "logit")) ->model2*

## 1 comment:

This stage in model development process is probably the longest and the most difficult phase of any credit risk model development project. It’s main purpose is to determine if scorecard development is can be built (or not) as well as to set the high-level parameters for the project. Those parameters are typically exclusions, target definition, sample window, and performance window.

I talk about this at Highstone Tower blog very often... feel free to comment

http://www.highstonetower.com/?p=1718

Post a Comment