Ai Forums Home Welcome Guest    Wednesday, May 23, 2018
Ai Site > Ai Forums > Language Mind and Consciousness > Corpus: Chinese Short Message Service Last PostsLoginRegisterWhy Register
Topic: Corpus: Chinese Short Message Service

posted 10/14/2010  10:43Send e-mail to userReply with quote


Release Date: May 28, 2010

Corpus: Chinese Short Message Service

Abbreviation: CSMS

Version: 1.0

Copyright: Wuying Liu

(1)email:; <Natural Language Processing Laboratory>
(2)mobile phone: 13787784974
(3)qq: 44631423

Data Type: Text, UTF-8 code

Language: Chinese

Application: SMS Spam Filtering, Short Text Processing

(1)The CSMS corpus is made up of real-world Chinese mobile messages in their chronological sequence, obtained from volunteers and manually labeled two categories {spam, ham} according to volunteers' feedbacks.
(2)The CSMS corpus consists of 85,870 messages, containing 21,099 spams and 64,771 hams.
(3)Each message includes FromPhoneNumber, ToPhoneNumber and BodyText fields; For the privacy protection, the phone numbers are replaced without changing the communication relation network.
(4)The SMS texts and category labels are stored separately; The SMS texts are stored under the dir "csms/data/", including 85,870 text files; The category labels are stored under the dir "csms/full/".

(1)The SMS file "csms\data\csms.1" is showed as below
$$$$$$$$ 这八个金钱符转发给八个好朋友.你这一年就会财源滚滚.如果删除不发.那你这一年就会破财.发吧!我也是被逼的,谁叫你人缘好呢
(2)The category label file "csms\full\index" is showed as below
spam ../data/csms.1
ham ../data/csms.2
ham ../data/csms.3


'Send Send email to user    Reply with quote Reply with quote    Edit message Edit message

Forums Home    The Artificial Intelligence Forum    Hal and other child machines    Alan and other chatbots  
Contact Us Terms of Use