 | *************************************************************
NLPLAB No.: NLPLAB2010T003
Release Date: May 28, 2010
Corpus: Chinese Short Message Service
Abbreviation: CSMS
Version: 1.0
Copyright: Wuying Liu
Contact:
(1)email: nlplab@163.com; <Natural Language Processing Laboratory>
(2)mobile phone: 13787784974
(3)qq: 44631423
(4)web: http://nlplab.webhop.net
Data Type: Text, UTF-8 code
Language: Chinese
Application: SMS Spam Filtering, Short Text Processing
Introduction:
(1)The CSMS corpus is made up of real-world Chinese mobile messages in their chronological sequence, obtained from volunteers and manually labeled two categories {spam, ham} according to volunteers' feedbacks.
(2)The CSMS corpus consists of 85,870 messages, containing 21,099 spams and 64,771 hams.
(3)Each message includes FromPhoneNumber, ToPhoneNumber and BodyText fields; For the privacy protection, the phone numbers are replaced without changing the communication relation network.
(4)The SMS texts and category labels are stored separately; The SMS texts are stored under the dir "csms/data/", including 85,870 text files; The category labels are stored under the dir "csms/full/".
Example:
(1)The SMS file "csms\data\csms.1" is showed as below
13910000001
13810000002
$$$$$$$$ 这八个金钱符转发给八个好朋友.你这一年就会财源滚滚.如果删除不发.那你这一年就会破财.发吧!我也是被逼的,谁叫你人缘好呢
(2)The category label file "csms\full\index" is showed as below
spam ../data/csms.1
ham ../data/csms.2
ham ../data/csms.3
...
*************************************************************
|  |