Big Data

Autor: LJHh • March 17, 2019 • Lab Report • 1,503 Words (7 Pages) • 916 Views

Page 1 of 7

Introduction

In order to increase market share and diversify bank loan risks, banks issue credit cards in large quantities, regardless of whether the applicant meets the requirements. However, under the influence of the idea of advanced consumption, many cardholders use credit card consumption in large quantities, ignoring their ability to pay. Although this has increased bank income, it has also increased the default risk of lenders. In our case, due to excessive issuance of Taiwan's credit card issuing banks, the credit risk of cash and bank cards increased, triggering a debt crisis. According to bank forecasts, the credit card default rate will reach its highest value in the third quarter of 2006. Risk itself is not a bad thing, our main responsibility is to manage risk. The worst thing is that there is no correct understanding of the risks and the risk of mismanagement. With the development of science and technology, effective methods and tools can be used to mine, organize and analyze massive data. The purpose of our experiment is to use a quantitative management tool and an econometric model to conduct a feasible analysis of the credit risk of the person applying for the loan. Using advanced data mining techniques and statistical analysis methods, through the statistical analysis of individual customer data for loan applications, explore the relationship between customer characteristics and credit risk, and develop into a predictive model to comprehensively score customers' future credit performance.

2. Data description

The data of the study is collected from a card-issuer bank in Taiwan, at October 2005, which credit card holders of this bank as the object of the study. Raw data have 30,000 observations, in my study, there are 8,000 instances randomly selected from the total, which including 1743 card holders(21.79%) were considered as default.

There are 23 variables as independent variables:

[pic 1]

Figure 2.2 Data dictionary

3. Model building and evaluation

3.1 SPSS Modeler

IBM SPSS Modeler is a software for data mining that quickly builds accurate predictive models and provides predictive information. In SPSS Modeler, the first step is input the data source and adjust the types of variables. The types are revised as the figure 3.1.1

Then, data audit is an necessary step to exam the validation of the data. From the figure 3.1.2, the percentage of complete are all 100, in that case, there are no missing data.

In order to avoid overfitting, dataset is divided by two parts, 70% data are used as training data, and remining data are testing data, setting is shown in figure 3.1.3

3.2 Decision Tree, Logistic Regression and K-Nearest Neighbors

Decision Tree

Decision tree algorithms are often used in data mining as classification and prediction methods. Like the real trees, structure of decision tree is made by three parts, leaf nodes, decision nodes, and branches. In business application, the probability of prediction is more important than the classification itself. Model C5.0 is used in this study, which use the Information Measure to build a decision tree. Make decisions at each node, generate branches, split the data until the data cannot be split. It is suitable for the nominal or ordinal target variable, and continuous independence variables.

...

Download as: txt (9.5 Kb) pdf (1.2 Mb) docx (1.2 Mb)

Continue for 6 more pages »

Read Full Essay Save