|posted 7/16/2007 03:44|
|I am doing some "research" on language processing and need massive amounts of simple English text. the more the better. I was wondering how i might find such a huge source of text that can be (at least somewhat) easily accessed by machine.|
|posted 7/16/2007 05:42|
|Try public domain books, some 20 000, they all can be obtained in text format, and they are good in that the texts can be freely used, even without mentioning the author. But what is bad is that these texts are mostly old, and thus doesn't well reflect the today's knowledge |
| Project Gutenberg|
|posted 8/22/2007 01:38|
|Excuse me, colleagues, please. I'm afraid I have to interject some views here. |
To give you a specific -- and therefore useful -- answer to aid in your "research", you probably need to go to any of the Project Gutenberg related sites to acquire the texts you need. Do a Google search. There you will find virtally every book ever written, in e-text form, which is now in the public domain, meaning free of any copyright restrictions. For your use, you may need to credit your source. See the so-called Project Gutenberg fine-print, attached to the heading of each Project Gutenberg text. Contact the site hosts and you might be able to acquire many or all of these files already loaded on CD's or DVD's in zip files, which might save you a great deal of time otherwise spent in downloading each of them individually.
Your statement "But what is bad is that these texts are mostly old, and thus doesn't well reflect the today's knowledge" strikes me as sadly typical of much of the current attitude toward the enormously valuable heritage that is the sum total of history and literature. Maybe you presume that WITASH is mainly interested in recent treatments of mostly technical subjects, and maybe he is. But if you are implying -- as I'm afraid you are -- that those books in public domain -- like Shakespeare, Chaucer, Milton, Dante, H.G. Wells, Poe, Melville, etc. etc. etc. -- are "mostly old" and do not "well reflect... today's knowledge", I am very sorry to report to you that, of ALL books published to date, those in public domain are, in fact, not only the solid foundation but all the important structural supports of our crumbling culture. These precious works, though "old" compared to magazines on your local newstand, not only "reflect... today's knowledge", they are the main portion of it, the majority of it, the cream of it, and the purest concentrated core-essence of it. If I have misjudged you, I sincerely apologize. But I shudder to think what short-sighted inheritors of our legacies will one day be the only ones who survive to archive the ruined edifice of our collective wisdoms.
|posted 11/14/2007 04:11|
Download Wikipedia... that's 3+ GB of text for you right there :D
|posted 2/11/2008 23:02|
I feel like these answers are a little flippant... I'm working on the development of an algorithm to build grammar rules, and it requires my program (ALEX--Artificial Learning by Erudition eXperiment) to begin with simple texts, just like a child would. Sources like Wikipedia proved too difficult for initial material. Once the structure of simple sentences was better understood, I could move on to more complicated sentence structures.
Witash wrote @ 7/16/2007 3:44:00 AM:
I am doing some "research" on language processing and need massive amounts of simple English text. the more the better. I was wondering how i might find such a huge source of text that can be (at least somewhat) easily accessed by machine.
This site has children's stories online: http://www.magickeys.com/books/
This website links to resources for a variety of sentence complexity levels: http://www.ucalgary.ca/~dKBrown/stories.html
I had found others, but I can't recall them off the top of my head. I don't know what to say about using these stories as far as copyright goes. I just stripped them for grammar. :) But once I get to the point where ALEX is learning facts, I'll probably look at it more closely.
Good luck with your 'research'!
|posted 10/22/2008 15:05|
|If you are looking to emulate how children learn language, what you might need is something like what children are actually exposed to.|
One example which I have seen used on connectionist models of language learning is the Bernstein-Ratner corpus.
You might want to check out some of Morten H. Christiansen's work, and of course Elissa Newport's work on limiting search space through maturational constraints. You should be able to find both these scholar's work in your local academic library. Hylleberg Christiansen is currently an associate professor at Cornell, and Newport teaches and Rochester, NY, if I remember correctly.
Not sure if it's any help, but it might be :)
| Bernstein-Ratner corpus|