Corpus: Chinese Short Message Service

Release Date: May 28, 2010

Corpus: Chinese Short Message Service

Abbreviation: CSMS

Version: 1.0

Copyright: Wuying Liu

(1)email:; <Natural Language Processing Laboratory>
(2)mobile phone: 13787784974
(3)qq: 44631423

Data Type: Text, UTF-8 code

Language: Chinese

Application: SMS Spam Filtering, Short Text Processing

(1)The CSMS corpus is made up of real-world Chinese mobile messages in their chronological sequence, obtained from volunteers and manually labeled two categories {spam, ham} according to volunteers' feedbacks.
(2)The CSMS corpus consists of 85,870 messages, containing 21,099 spams and 64,771 hams.
(3)Each message includes FromPhoneNumber, ToPhoneNumber and BodyText fields; For the privacy protection, the phone numbers are replaced without changing the communication relation network.
(4)The SMS texts and category labels are stored separately; The SMS texts are stored under the dir "csms/data/", including 85,870 text files; The category labels are stored under the dir "csms/full/".

(1)The SMS file "csms\data\csms.1" is showed as below
$$$$$$$$ 这八个金钱符转发给八个好朋友.你这一年就会财源滚滚.如果删除不发.那你这一年就会破财.发吧!我也是被逼的,谁叫你人缘好呢
(2)The category label file "csms\full\index" is showed as below
spam ../data/csms.1
ham ../data/csms.2
ham ../data/csms.3


