speculative visionscience fiction and fantasy

Searching the internet for strangeness

"What would happen if...?" has always been a staple of Science Fiction. What do you wonder about?

Moderators: Bmat, Qray

    Bookmark and Share
 

Searching the internet for strangeness

Postby interestingdave » Mon Jun 27, 2011 11:34 am

Not sure if this is the right forum for this post …

I was wondering how to write a program to search the internet for strangeness. I thought about copying the archive of a broadsheet newspaper, such as the London Times, searching all the stories and recording all the nouns and verbs, except very common ones and pronouns, then correlating all the words in stories, and finding the words which are least likely to appear in the same story. Then I'd have a benchmark for searching for weirdness.

But if anyone has a better idea ...
interestingdave
Just Registered
Just Registered
 
Posts: 1
Joined: Mon Jun 27, 2011 11:24 am
 

 

Re: Searching the internet for strangeness

Postby Bmat » Mon Jun 27, 2011 2:13 pm

Welcome to Speculative Vision! I don't have any suggestions about writing a program. Maybe someone else here will.
User avatar
Bmat
Super Moderator
Super Moderator
 
Posts: 5763
Joined: Tue Apr 05, 2005 5:31 pm
Location: East coast US
Blog: View Blog (10)
 

 

Re: Searching the internet for strangeness

Postby nightlock » Mon Jun 27, 2011 5:26 pm

Such a programme isn't easy to write, first off you will need input from the internet, I doubt you want to sit there manually entering articles so you will have to make the program able to read off from RSS syndication feeds.
Chances are, however that you will simply receive HTML rather than the bare article (I doubt sites have an HTML free option to download their articles) so you would have to run a parser which cleans the articles of extraneous HTML, which in itself is fiddly. This is just to get barebone text.

Once there you will have to count words and word combinations and store the results. Identifying words can be done by simply looking for the spaces or periods that surround it. Storing results sounds easy, but it isn't. Since you can't actually go and define all words in the English language in a database, you will need to specifically design the database to allow custom words to be counted per article, requiring several levels of normalisation. This is probably still the easiest part. I have no idea how you would go about defining word combinations, or how to filter on nouns and verbs without hacking your way through a spell checker at least partway. At which point you'll end up specifying a lot of words in advance anyway, which seems to me to be counter to your quest to find strangeness.

Any coding language should do, as long as they support internet access (which they all should) so you can read the syndicate feeds and download articles.
Image

Read New Awakenings

"This is here." :smt104
User avatar
nightlock
Site Regular
Site Regular
 
Posts: 460
Joined: Fri Sep 05, 2008 1:28 pm
Location: Netherlands
Blog: View Blog (3)
 


Return to Speculation

Who is online

Users browsing this forum: No registered users and 1 guest

cron