2009年1月15日 星期四
跳棋與實驗
我覺得這次出遊最有趣的一段就是我們一起玩跳棋,這個遊戲可能有20年沒玩過了,有的規則還詢問了一下素倩阿姨。跳棋跳出了很多新得,跳棋致勝的訣竅之一,就在於他的名字..跳..、快速全員離開、兩個人玩還是三個人玩的困難點不同、重要的是專注。。。讓自己跳曜前進。。。當預定路線被擋住,要快速找到其他替代路線。。。我的研究路又何嘗不是如此。。。
小妹妹從各種不同角度下跳棋。。。非常可愛。。。
---
而今天,我的論文實驗又再次的在失敗這一區又劃上一槓了。。。
好想好想在農曆年前有個成功的實驗結果。。。
這些磚牆並不是阻礙,她們只是讓我知道我到底有多麼想要得到這紙畢業證書。。。
---
今天,跟永德討論他的資料問題,發現我看構思資料的速度有進步。。。
只希望,我對如何構思成功實驗結果的速度能早日突破。。。
2009年1月13日 星期二
ROC Curve
Receiver Operating Characteristic (ROC) Curve
這項技術起初是為了增進軍事雷達的敵我偵測能力而發展的。
指出ROC曲線是以「X軸與Y軸分別代表偽陽性診斷與真陽性診斷」的點狀圖(1971,Lusted)[Lusted LB: Signal detectability and medical decision-making. Science 1971;171:1217-9]
以「ROC曲線下的面積」做為診斷工具分辨能力的指標。要了解一個診斷工具是否優於另一工具,只要比較兩者「ROC曲線下的面積」就可以得到答案了。(1973,Simpson and Fitter)[Simpson AJ, Fitter MJ: What is the best index of detectability?Psychol Bull 1973;80:481-8]
臨床上可用的診斷工具,其ROC曲線是一條凸向左上方的曲線,而且愈偏離45度對角線愈好(1987,Murphy and Berwick)
45度對角線(圖二)被稱為「無訊息線」(Line of no information),這條線代表診斷工具的診斷結果,對醫師判斷病人是否有病,沒有提供任何有效的訊息,也就是說,做這項檢驗的效用和扔銅板(指正反面出現機率相等的銅板)決定有病、沒病是一樣的。因此,早期判斷一項診斷工具是否可用的指標,就是ROC曲線偏離45度對角線多遠。[Murphy JM, Berwick DM, Weinstein MC: Performance of screening and diagnostic tests. Arch Gen Psychiatry 1987;44:550-555]
「ROC曲線下的面積」就代表診斷工具猜對的機率有多大。猜對的機率愈大代表診斷工具愈好。[Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 1982;143:29-36.][Centor: Signal detectability: the use of ROC curves and their analyses. Med Decis Making 1991; 11:102-106.]在臨床上,由於新的檢驗技術不斷地推陳出新,如果這個新技術是以數字呈現檢驗結果,則必須訂出一個「正常值」的範圍,作為醫療人員解讀的依據。而ROC曲線正是許多研究者用以決定「正常範圍」的工具。因此了解ROC曲線的原理,將有助於臨床醫療人員對各種檢驗數據的理解與詮釋(1997,盧誌明和藍守仁http://www.geocities.com/shinyuanclub/update97/lucm0115.html
2009年1月7日 星期三
[作業] 資料探勘專題期末心得報告
本學期之上課主要有三大主題:分散式演算法、Rough set以及Petri Net。
跟著老師與同學共同成長,從討論當中所聽到的每一句話,都可能帶給自己無限的啟發。藉由同儕間的報告,老師可以講述或是釐清其中的精華內容,經由老師幾句指點,很多原本感到疑惑的地方,也就慢慢打通任督二脈,逐漸發展一種自發學習的能力,同時也知道碰到問題可以看哪些東西。學習是一種self-help,並且是在老師的引導下學習self-help。跟老師共同探索一個問題、一個未知的領域。
當研究的目標選定後,讀書、了解資料、清理資料花掉大部份的時間,老師念茲在茲的提醒大家別忘了保留思考的能力。台大的傅鐘為什麼是21響,原來是傅斯年教授擔任台大校長時曾說:「人一天只有21個小時,另外3個小時是要思考的。」想一想自己看到了什麼?要跳到更高一點的層次去思考。
英文單字Search是尋找,而research是再尋找,不斷的一遍一遍再尋找。論文不嚴謹,之後修改就要花很多時間,從頭改到尾,耗時費力。每一次有構想時都想把他寫好,但是自己為什麼遲遲無法有成功的作品?這個問題我也一再一再地反問自己為什麼?是想太多?想太細?佈局不清楚?實驗結果不夠好?問題無法克服?只想不動手?其實原因我認為有好多,只是每個階段遇到的都不同,現在這個階段困在兩個主因(1)時間不集中;類似老師說的集中12天還是一個月1天的這種問題 (2)生有涯、知無涯;從哪裡開始,也要知道從哪裡放手,不要無限的追下去。知道病因後應該就比較好治療,我會好好治療我自己。
遇到問題時,我總試圖先自己尋找解答,實在是想不出來時,才會去找老師。因為我實在是不偏好一有問題就找老師。雖然有的地方我繞了比較久,但是我相信從中已累積許多經驗與收穫。不怕慢、只怕站(我會試圖跑快一點),我還有很多研究上的盲點與不足,還有賴老師多指點,謝謝老師。
2009年1月6日 星期二
[作業] 十個英文句子[樣板]
Decision tree learning is one of the most widely used and practical methods for inductive inference.
sentence (2).
To construct a decision tree, we have to select appropriate attributes as the tree nodes.
sentence (3).
Many methods are available for attribute selection, such as the entropy based methods[3,4], Bayesian networks [5], gini index methods [6,7], etc.
sentence (4.1).
Rough set theory, proposed by Poland mathematician Pawlak in 1982, is a new mathematic tool to deal with vagueness and uncertainty [9].
sentence (4.2).
Pfeifer and Carraway (2000) proposed Markov Chain Models for modeling customer relationships.
sentence (4.3)
Colombo and Jiang (1999) developed a stochastic Recency Frequency Monetary model to rank customers in terms of their expected contribution.
sentence (4.4)
More detailed discussion about the process of rough set theory can refer to Walczak and Massart (1999).
sentence (5.1)
However, the proposed approach also has its limitations. It can do well only in accurate classification where objects are strictly classified according to equivalence classes, hence the induced classifiers lack the ability to tolerate possible noises in real world data sets.
sentence (5.2)
However, the long-term value does not fit for the industry having stiff competitions and rapid changes of market environments. Especially, it is not easy to evaluate the LTV of customers in the wireless communication industyr, which are very sensitive to the external environments and the customer defections. Hence, this study focuses on the short-term value of customers of a wireless communication industry.
sentence (6.1)
Customer value is classified into three categories: current value, potential value, and customer loyalty.
sentence (6.2)
customers are segmented according to three types of customer value.
sentence (6.3)
Verhoef and Donkers (2001) used two dimensions, current value and potential value, to segment the customers of an insurance company.
sentence (6.4)
We use three dimension, current value, potential value, and customer loyalty, to consider the customer definition in this study.
sentence(7.1)
We describe three popular models used in building credit scoring models. The first model is logistic regression, which is mostly used for classification problems in the area of statistics. The second model is ANN, which is known for its excellent ability of learning non-linear relationships in a system. The third model is rough sets, which is one kind of induction based algorithms, and has been widely used in classification problems since 1990s.
sentence(8.1)
Two numerical examples will be employed here to compare the error rate to other credit scoring models including the ANN, decision trees, rough sets, and logistic regression.
sentence(8.2)
In this section, GP is compared to MLP, classification and regression tree (CART), C4.5, Rough sets, and logistic regression (LR) using two-real world data sets.
The first data set includes Australian credit scoring data with 307 examples of credit worthy customers and 383 examples for credit unworthy customers. It contains 14 attributes, where six are continuous attributes and eight are categorical attributes.