Searching the internet for strangeness

"What would happen if...?" has always been a staple of Science Fiction. What do you wonder about?

Moderator: Bmat

Post Reply
interestingdave
Just Registered
Just Registered
Posts: 1
Joined: Mon Jun 27, 2011 11:24 am

Searching the internet for strangeness

Post by interestingdave »

Not sure if this is the right forum for this post …

I was wondering how to write a program to search the internet for strangeness. I thought about copying the archive of a broadsheet newspaper, such as the London Times, searching all the stories and recording all the nouns and verbs, except very common ones and pronouns, then correlating all the words in stories, and finding the words which are least likely to appear in the same story. Then I'd have a benchmark for searching for weirdness.

But if anyone has a better idea ...

User avatar
Bmat
Super Moderator
Super Moderator
Posts: 5897
Joined: Tue Apr 05, 2005 5:31 pm
Location: East coast US

Re: Searching the internet for strangeness

Post by Bmat »

Welcome to Speculative Vision! I don't have any suggestions about writing a program. Maybe someone else here will.

User avatar
nightlock
Site Regular
Site Regular
Posts: 460
Joined: Fri Sep 05, 2008 1:28 pm
Location: Netherlands
Contact:

Re: Searching the internet for strangeness

Post by nightlock »

Such a programme isn't easy to write, first off you will need input from the internet, I doubt you want to sit there manually entering articles so you will have to make the program able to read off from RSS syndication feeds.
Chances are, however that you will simply receive HTML rather than the bare article (I doubt sites have an HTML free option to download their articles) so you would have to run a parser which cleans the articles of extraneous HTML, which in itself is fiddly. This is just to get barebone text.

Once there you will have to count words and word combinations and store the results. Identifying words can be done by simply looking for the spaces or periods that surround it. Storing results sounds easy, but it isn't. Since you can't actually go and define all words in the English language in a database, you will need to specifically design the database to allow custom words to be counted per article, requiring several levels of normalisation. This is probably still the easiest part. I have no idea how you would go about defining word combinations, or how to filter on nouns and verbs without hacking your way through a spell checker at least partway. At which point you'll end up specifying a lot of words in advance anyway, which seems to me to be counter to your quest to find strangeness.

Any coding language should do, as long as they support internet access (which they all should) so you can read the syndicate feeds and download articles.
Image

Read New Awakenings

"This is here." :smt104

Post Reply