ontologizing
[Top] [All Lists]

Re: [ontologizing] Draft Taxo-Thesaurus Facets

To: ontologizing@xxxxxxxxxxxxxxxx
From: Lisa <lisadawncolvin@xxxxxxxxx>
Date: Mon, 25 Jun 2007 22:12:48 -0700 (PDT)
Message-id: <283071.47477.qm@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
Hi,    (01)

Reposting to "ontologizing" for visibility into process.    (02)

Lisa    (03)

> 
> I did a quick scan of the URLs to see what could be useful (of the html 
>pages) for
> concept-extraction consideration and what could be left out.
> 
> Some domains appear to be spam:
> 
> blogs.ihola.net
> sapo.pt 
> {carinsurance.home.sapo.pt
> hydrocone.home.sapo.pt
> ...} 
> directory.planetdns.net
> ephedra.*.* 
> {ephedra.eu.gg
> ephedra.guest.de
> ephedra.guests.de
> ephedra.jixx.de
> ephedra.us.gg
> ephedra.jix.net
> ephedra.web.gg }
> 
> Perhaps you can add some stopwords like "foreclosure , carinsurance, rx" and 
>various
> pharmaceutical names to exclude the spam URLs?
> 
> The rest of the URLs appear to be a mix of organization pages (public and 
>private), individual
> pages, standard committees, and information sources (wikipedia) which could 
>be useful.
> 
> For the ontolog-specific URLs, these appear to be "useful":
> 
>  . event planning documents
> 
>(http://ontolog.cim3.net/file/work/Ontolog-planning/Ontolog_Event_Plan_2006_20060309d.doc)
>  . media supporting the website or presentations (gif,mp3 and ppt files)
>  . pages generated by some queries (which point to "interesting" content 
>(have "wiki.pl" tag -
> includes things like Individual WikiWord pages, projects, Conference Calls)
>  . presentation/ working files (often marked with metadata/ name of file)
> 
> and these appear "less useful":
>  . pages generated by some queries which  point to "non-interesting" content 
>(like DIFF pages -
> have "wiki.pl" tag - but also have "diff" in the URL), edit-mode pages (have 
>"action=edit" in
> the
> URL) and login pages (have "action=login" in the URL) 
>  . time and date files (have "timeanddate.com" in the URL)
>  . individual e-mails ("mailto:someone@...") pages
> 
> I'm sure I missed some things. This is a first manual pass.
> 
> Please forward to <ontologizing> if you think this should be. Thanks!
> 
> :) Lisa
>     (04)


_________________________________________________________________
Msg Archives: http://ontolog.cim3.net/forum/ontologizing/ 
Subscribe/Unsubscribe/Config: 
http://ontolog.cim3.net/mailman/listinfo/ontologizing/
Community Portal: http://ontolog.cim3.net/
Community Files: http://ontolog.cim3.net/file/work/OntologizingOntolog/
Community Wiki: http://ontolog.cim3.net/cgi-bin/wiki.pl?OntologizingOntolog    (05)
<Prev in Thread] Current Thread [Next in Thread>