Ai Forums Home Welcome Guest    Saturday, July 30, 2016
Ai Site > Ai Forums > The Artificial Intelligence Forum > Natural Language? Last PostsLoginRegisterWhy Register
Topic: Natural Language?

Daxamite
[Guest]
posted 9/17/2003  23:25Send e-mail to userReply with quote
Aoccdrnig to rscheearchat by a Biritsh uinervtisy, it deosn't mttaer
in waht oredr the ltteers in a wrod are. The olny iprmoatnt tihng is
taht the frist and lsat ltteer is at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.

Just thought I would post this to show how bad a phrase "natural language" really is.


Geust
[Guest]
posted 9/18/2003  01:02Reply with quote
Tihs maens taht gruoping si mroe imoptrant tahn sqeuence.



Rob Hoogers
posted 9/18/2003  06:33Send e-mail to userReply with quote
Not entirely true. The first and last must remain the same. I imagine something of the same happens if you change the word order in the sentence....

On the other hand, the first bot to be able to read this easily will be very fault-tolerant with typos, believe me.


Ted Warring
posted 9/19/2003  03:52Send e-mail to userReply with quote
Is anybody interested in using this approach for a project? We put together a system that utilizes a similar tactic to how we think the brain interprets words. The system uses fuzzy logic.

If anyone is interested let me know and I will put the demo up on our website. It is just over 500k or I would post it here. The demo has weights that are tweaked for the kind of sample given here. They would probably be different for real use typo patterns, as that words are not usually so totally mangled. The system can even train up.

Here is the example input and output that I am cutting and pasting from the demo:

Input:

Aoccdrnig to rscheearch by a Biritsh uinervtisy, it
deosnt mttaer in waht oredr the ltteers in a wrod are.
The olny iprmoatnt tihng is taht the frist and lsat
ltteer is at the rghit pclae. The rset can be a toatl
mses and you can sitll raed it wouthit a porbelm.
Tihs is bcuseae we do not raed ervey lteter by itslef
but the wrod as a wlohe.

Output:

According To Research By A British University
It Doesn't Matter In What Order The Letters In A
Word Are The Only Important Thing Is That The First
And Last Letter Is At The Right Place The Rest Can
Be A Total Mess And You Can Still Read It Without A
Problem This Is Because We Do Not Read Every
Letter By Itself But The Word As A Whole


If you are interested just drop me an email.

 Artificial Ingenuity

Rob Hoogers
posted 9/19/2003  16:34Send e-mail to userReply with quote
How long does it take? Might be an issue... and it still has the advantage of the first and last letter. How does it fare with longer, slightly rarer words, like pmihpaercooaa?


Ted Warring
posted 9/19/2003  18:25Send e-mail to userReply with quote
 
Rob Hoogers wrote @ 9/19/2003 4:34:00 PM:
How long does it take? Might be an issue... and it still has the advantage of the first and last letter. How does it fare with longer, slightly rarer words, like pmihpaercooaa?

 
I haven't started optimizing it yet, so it is not blindingly fast with the 110,000 word dictionary I am testing it with. I expect to imrpve the performance by a factor of at least 5 with a couple of simple optimizations.

With a 10,000 word common English word list it works fast as it is, but recognizes less of course.

It came up with:

Input:

pmihpaercooaa

Output:

Pharmacopoeia


Ted Warring
posted 9/19/2003  23:11Send e-mail to userReply with quote
Here is an attached zip with the demo program and a 47,000 word list. This word list comes with a README.txt about it's copyright, which is also in the zip file.

This smaller list recognizes the above example text in just under 4 seconds on a 600Mhz machine. The bigger 110,000 word list (that can catch Rob's above tough example) takes about 8 seconds. As you can see the algorithm is roughly linear in it's performance as a function of the size of the word list.

The system has not been optimized yet, so I expect to decrease the example to at least 2 seconds with the big word list.

The weights included in the example are optimized for the rather extreme case we are recognizing. I suspect that "normal" use will be spelling errors and simple typos, which would probably have more weight on the correct part of each word.

I will also post a separate zip of the big word file.


 FR_small.zip

Ted Warring
posted 9/19/2003  23:14Send e-mail to userReply with quote
Here is a zip of the bigger (110,000) word list.

The demo program will load FuzWrds.txt from the directory you launch in, so you can switch between word lists by copying or renaming the list you want to the default dictionary name.

 BigFuzWrds.zip

Rob Hoogers
posted 9/20/2003  13:57Send e-mail to userReply with quote
Nice one. But you see there are words that would stomp a human, even. Also: did you 'fix' the first and last letter, or is that still necessary to be given correctly?

I wouldn't mind giving you a hand optimizing it. As I said earlier, I had thought of doing something similar (and had it worked out in scratch). 8 seconds seems a bit longish... What lingo did you use? Do you look up every permutation?


Sam Fentress
posted 9/22/2003  18:14Send e-mail to userReply with quote
Ted, the program works very well, and I am impressed. It would be interesting if it could learn, though. I figure this could work either by training, ie it is given both the input and the output, or else, better, it could decide on a most likely output by itself, based on how many words are in the list, and also which words are more common (which would of course require values to be given to each word, which couldn't be done by hand - maybe a program that just does word counts on large amounts of text?).

After much playing with weights I got

"naw is thhe wetnir of our disnoctent, ade glorius smmr by tihs snu of yrok"

transformed into

"NAW Is The Winter Of Our Discontent Ae Glorious SMMR By This SNU Of York."

If this playing of weights could be done by the program, could make guesses at NAW, SMMR and SNU, and figure out that Made was more likely than Ae, this would be worth its weight in gold.

PS: Good to see many of the old people are around! It's been a while for me.

PPS: Added: ooh! Getting better:

"Now Is The Winter of Our Discontent Ae Glorious Simmer By This Sou Of York"

Last edited by Sam Fentress @ 9/22/2003 6:21:00 PM

Ted Warring
posted 9/22/2003  20:43Send e-mail to userReply with quote
 
Rob Hoogers wrote @ 9/20/2003 1:57:00 PM:
Nice one. But you see there are words that would stomp a human, even. Also: did you 'fix' the first and last letter, or is that still necessary to be given correctly?

I wouldn't mind giving you a hand optimizing it. As I said earlier, I had thought of doing something similar (and had it worked out in scratch). 8 seconds seems a bit longish... What lingo did you use? Do you look up every permutation?

 
Thanks. It doesn't need to have both first and last. The weights with which it is set by default in the demo are for the sample type. Meaning it expects the first and last to be the same, the length to be close, and that most of the letters wil not be in correct positions.

As far as performance, I agree that 8 seconds is too long. There are several optimizations that I can think of immediately, so that should be cut down by a factor of 4-5.

But also keep in mind that correctly spelled words are almost instantaneous. If you feed the corrected text back through it only takes a tenth of a second or so.

This was coded in Delphi 5.0, but could just as easily be coded in C++ or C#.

No, it doesn't look at permutations at all. It does a fuzzy analysis of the most probable intended word.


Ted Warring
posted 9/22/2003  21:04Send e-mail to userReply with quote
 
Sam Fentress wrote @ 9/22/2003 6:14:00 PM:

It would be interesting if it could learn, though. I figure this could work either by training, ie it is given both the input and the output, or else, better, it could decide on a most likely output by itself, based on how many words are in the list, and also which words are more common (which would of course require values to be given to each word, which couldn't be done by hand - maybe a program that just does word counts on large amounts of text?).

If this playing of weights could be done by the program, could make guesses at NAW, SMMR and SNU, and figure out that Made was more likely than Ae, this would be worth its weight in gold.



 
Glad to see you back Sam!

One training method would be to give it a list of mangled and a list of corrected words, and then use a GA to evolve the best weights.

The "real" version that I am working on has several SETS of weights, each of which produces a "confidence" level as to the word. That way the set of weights that is most convinced it is correct will provide the solution. Each agent is running as it's own process to minimize the performance hit.

The commercial version will also have a soundex and bayesian probability function. The latter would assign additional weight to a considered word based upon it's probability of use. Perhaps even contextually, but that takes quite a bit more overhead.

I am considering providing this as a web service eventually.


Rob Hoogers
posted 9/22/2003  21:39Send e-mail to userReply with quote
Hi Sam, good to see you back indeed. ;)




Sam Fentress
posted 9/24/2003  03:30Send e-mail to userReply with quote
 
Ted Warring wrote @ 9/22/2003 9:04:00 PM:

One training method would be to give it a list of mangled and a list of corrected words, and then use a GA to evolve the best weights.


 
So would you be thinking of creating a large population of these programs with random weights and then reproducing from the best ones? In this situation would that be more efficient than just allowing the program to slightly adjust it's weights for each new problem, iterating through the entire set until it has them all as close as possible to being right?

I just ask because when using neural networks those are usually your two options (evolution or training), and was wondering if you had a specific reason for using a GA. It would seem to me that by training it, it would continue to improve during use, even after someone was using the end product. Of course, that would only work if you let it come up with the correct solution by itself, instead of having it explicitly told to it.

By the way, do commercial spell-checkers use fuzzy-logic at all?

PS Good to see you to Rob, and cheers to Yaki and Raphael if they are reading these posts.

Last edited by Sam Fentress @ 9/24/2003 3:32:00 AM

Raphael
posted 9/24/2003  05:38Send e-mail to userReply with quote
We missed ya, Sam. :-)


Rob Hoogers
posted 9/24/2003  09:37Send e-mail to userReply with quote
 
Sam Fentress wrote @ 9/24/2003 3:30:00 AM:

By the way, do commercial spell-checkers use fuzzy-logic at all?


 
Difficult to get data on that. I've found some products that claim to (and probably do), others seem more reticent about their software's inner workings.

One that does:

http://www.componentsource.com/Catalog/PolarSpellChecker_509982.htm


-
[Guest]
posted 9/24/2003  10:41Reply with quote
Iltnsegnetiry I'm sdutynig tihs crsrootaivnel pnoheenmon at the Dptmnearet of Liuniigctss at Absytrytewh Uivsreitny and my exartrnairdoy doisiervecs waleoetderhlhy cndairotct the picsbeliud fdnngiis rrgdinaeg the rtlvaeie dfuictlify of ialtnstny ttalrisanng sentences. My rsceeerhars deplveeod a cnionevent ctnoiaptorn at hnasoa/tw.nartswdbvweos/utrtek:p./il taht dosnatterems that the hhpsteyios uuiqelny wrtaarns criieltidby if the aoussmpitn that the prreoecandpne of your wrods is not eendetxd is uueniqtolnabse. Aoilegpos for aidnoptg a cdocianorttry vwpiienot but, ttoheliacrley spkeaing, lgitehnneng the words can mnartafucue an iocnuurgons samenttet that is vlrtiauly isbpilechmoenrne.

Or, if you prefer...

Interestingly I'm studying this controversial phenomenon at the Department of Linguistics at Aberystwyth University and my extraordinary discoveries wholeheartedly contradict the publicised findings regarding the relative difficulty of instantly translating sentences. My researchers developed a convenient contraption at http://www.aardvarkbusiness.net/tool that demonstrates that the hypothesis uniquely warrants credibility if the assumption that the preponderance of your words is not extended is unquestionable. Apologies for adopting a contradictory viewpoint but, theoretically speaking, lengthening the words can manufacture an incongruous statement that is virtually incomprehensible. :)



nzilla
[Guest]
posted 9/24/2003  16:30Reply with quote
 
- wrote @ 9/24/2003 10:41:00 AM:
Interestingly I'm studying this controversial phenomenon at the Department of Linguistics at Aberystwyth University and my extraordinary discoveries wholeheartedly contradict the publicised findings regarding the relative difficulty of instantly translating sentences. My researchers developed a convenient contraption at http://www.aardvarkbusiness.net/tool that demonstrates that the hypothesis uniquely warrants credibility if the assumption that the preponderance of your words is not extended is unquestionable. Apologies for adopting a contradictory viewpoint but, theoretically speaking, lengthening the words can manufacture an incongruous statement that is virtually incomprehensible. :)


 
Concerning the tool mentioned in the abstract I would say that preponderance of a word shall be a homogeneous continuum through all the process of translation, the instantaneousness of what is indeed relative to cognitive processing of the translated text. Considering that relation is a step further towards a genuine real-time translation.



Ted Warring
[Guest]
posted 9/24/2003  16:38Reply with quote
 
- wrote @ 9/24/2003 10:41:00 AM:
Or, if you prefer...

Interestingly I'm studying this controversial phenomenon at the Department of Linguistics at Aberystwyth University and my extraordinary discoveries wholeheartedly contradict the publicised findings regarding the relative difficulty of instantly translating sentences. My researchers developed a convenient contraption at http://www.aardvarkbusiness.net/tool that demonstrates that the hypothesis uniquely warrants credibility if the assumption that the preponderance of your words is not extended is unquestionable. Apologies for adopting a contradictory viewpoint but, theoretically speaking, lengthening the words can manufacture an incongruous statement that is virtually incomprehensible. :)


 
How is this for output:

interestingly I M studying this controversial phenomenon at the department of linguistics at aftermath university and my extraordinary discoveries wholeheartedly contradict the published
findings regarding the relative difficulty of instantly translating sentences my researchers developed A convenient contraption at Honda TW narrowingness utter P IL that demonstrates that the hypothesis uniquely warrants credibility if the assumption that
the preponderance of your words is not extended is unquestionable apologies for adopting A contradictory viewpoint but theoretically speaking lengthening the words can manufacture an incongruous statement that is virtually incomprehensible

It mangled the URL, but oh well. I used the small word list, but added these words to it (it is just a text file of known words):

wholeheartedly
preponderance
unquestionable
incongruous

and here are the 3 weights to change for your sample type:

RightLength=5
RightLetter=10
WrongLetter=10

I don't think that your findings contradict the validity of our approach at all.


Ted Warring
[Guest]
posted 9/24/2003  16:45Reply with quote
 
Sam Fentress wrote @ 9/24/2003 3:30:00 AM:
So would you be thinking of creating a large population of these programs with random weights and then reproducing from the best ones? In this situation would that be more efficient than just allowing the program to slightly adjust it's weights for each new problem, iterating through the entire set until it has them all as close as possible to being right?

I just ask because when using neural networks those
By the way, do commercial spell-checkers use fuzzy-logic at all?



 
The only thing that need mutate is the set of weights, so the code could be relatively similar to a simple string mutation GA. My thought is to refine a different set of weights for each of a handful of jumble types. The algorithm need not change.

So I would feed through an example jumble set and the fitness test would be based upon 100% accurate recognition comparing to the manually corrected set. The set of training (fitness really) words need not be large, just a couple dozen or so.

As our unknown guest supplied as a good example, minor variations of weights mean correct interpretation of different cases, such as adding extra letters. The weights set as default in the demo were specifically for the jumble class from the original example.

  1  2  3  
'Send Send email to user    Reply with quote Reply with quote    Edit message Edit message

Forums Home    Hal and other child machines    Alan and other chatbots    Language Mind and Consciousness  
Contact Us Terms of Use