[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [syndication] Re: RSS feed filtered by keywords?



* Bill Kearney (ml_yahoo@ideaspace.net) [030204 14:58]:
> > 2) Authors (almost physically) placing items in a designated context -
> > in other words the author assigns his item to a catagory.   If the
> > reader finds the context, they can find the items they want.  I think
> > this is the case that you are thinking about and are forgetting about
> > case (1) above.  This is where the author specifies catagories in their
> > own feed - the author is then in total control of the context.
> 
> But if you call the category "work" what does that mean to me?  I have a
> category called "clients", what would that mean to you?  Without larger maps
> between your ideas of categories and topics it's as worthless as calling it "ham
> sandwich".  Even using terms like "new, important, old" and the like are
> semantically worthless without context.  WordNet is one possible out.  You make
> categories, I make categories and we both go pick some WordNet URLs to further
> define what we /mean/ within a given category or keyword.  THAT starts to make
> it much more informative.  This is not just simple keyword filtering and yet a
> keyword can be the 'easy to apply' metadata.  It's the background stuff that
> makes it truly start to /mean/ something.

I've made this comment elsewhere (and it's not intended to be
incendiary, though it probably takes a very liberal reading to hear
me as jovial and constructive):

If we ever reach a point in time where an automated system can ascribe
"meaning" accurately to portions of an information base then I believe
it is almost certain that there will be annotations in that information
base.  A useful system for navigation such a corpus will almost
certainly make use of such "semantic" annotations.  They will be part
of a Smart system.

We are, however, despite the efforts of some of the best minds of the
past century, nowhere near realizing that system.

In the current vacuum annotation of information is still useful, and we
can already make use of such annotation to aid with manual
classification (and automated classification based on manual
techniques), as well as keyword-based indexing and searching.  Google's
technologies, however, show that the benefits of annotation information
in the described vacuum are primarily benefits in degree (as in
efficiency) rather than in kind.

I.e., there are benefits to annotating information with metadata, but do
not be fooled into thinking that even a complete and systematic
annotation of a corpus gets us any closer to extracting "meaning" from
information in an automated manner.  

Metadata will be necessary but is nothing like sufficient.

See also below.

> Think about it.  Most people are in the stone age with their understanding of
> what it takes to effectively search for something.  They think just fobbing it
> off onto Google is all the effort they need to apply.  That is until Google
> tanks or becomes little more than a worthless pile of everything without
> relevance.  I've used /real/ searching before on things like Lexis.  There's two
> polar opposites, for the most part.  Google gives you almost no control and
> Lexis demands too much.  There's a somewhere in the middle that users want.
> 
> It's my suggestion that since weblogs are, by and large, driven by individual
> expression that efforts to utilize that aspect will be of more value than most
> existing synthetic search tools.  The fact that google and what not /can/ be
> used to search through this doesn't mean they should.
> 
> But without locally applied metadata using stuff like Google is the crude club
> we'd be stuck using.

I think I agree.  The Google box as accepted by the mass market (simple
search please, thanks) is constrained by what it can infer from what the
average user will plug in.  The Lexis search is powerful but expects an
expert user.  The right software, however, should be able to do more for
the novice and, ideally, get the results expected by the professional.
I believe this is possible but it's a number of years down the road.


To my mind there is a hierarchy of scalability (actually it's probably
just a partial order) with respect to needle-finding technologies.  The
parts of it I see (in rough order of scalability - note that some
strategies work better concurrently) look like this:

 - personally handling every haystack that you may want to search and
   trying to remember where in the haystack the needles were

 - manually "categorizing" haystacks and needles within -- possibly by 
   massive group effort (the Amish Model?); annotating haystacks and
   needles with relevant attributes

 - automatically categorizing haystacks and their needles, using
   criteria set by people; annotating haystacks and needles with
   relevant attributes

 - extracting keyword indices for needles automatically from haystacks;
   using annotations where available

 - using relationships between needles and haystacks to help with
   categorization and indexing of haystacks and needles (essentially
   the Google Model)

 - using "features" suitable for automatic classification (e.g., by a
   support vector machine or a boosting (e.g., via AdaBoost) classifier)

 - deriving "features" from "context" relevant to both the haystacks
   (and needles) and the user of a haystack searching engine


The first four make the farmer the slave to the haystack.  The fifth is
known to scale well under sufficient VC and well-handled PR.  The latter
two methods help not only on the haystack preparation side of our search
but also on the needle-finding side.  Note that annotation is only a
difference in degree, not in kind:  we know how to construct efficient
text indexes, adding keywords simply helps with weighting certain
keywords over others.

The methods are in the literature.  Look at the work done by the MIT
Media Lab on wearable and context-aware computing -- inferring "context"
breaks the Google searchbox limitation on crafting good searches
(shouldn't you be able to return better results if you know someone is
on their lunch break, or just finished reading an email containing a
Curry recipe?), but requires highly organized pre-processed haystacks to
work -- and requires smarts on the desktop of the needle-searcher to be
effective (and to maintain privacy).  Look at the literature on text
classification, particularly support vector machines, and particularly
"boosting" systems.

It is quite possible to have the system figure out how to organize the
haystack *and* figure out which specific needles are relevant to the
searcher.  The biggest technical issues [0] have to do with flexible
feature extraction, i.e., which textual and context features are
important to the most context-aware search systems.  We may find that
there is no common utile subset of extractable features, in which case
for a time our personal needle-finding systems will take an active
role in haystack pre-processing, until such time as a 3rd-party market
in custom feature extraction develops. [1]


[0] perhaps "biggest interesting technical issues" is appropriate -- the
    bulk of the traffic on this list (for example) deals with technical
    issues in such simple domains as RSS, 90+% of which should be
    resolvable through reference to a couple of Kb of documentation.
[1] I can guarantee there will be a demand, due to the convertible
    value of access to quality information, it only remains to be seen
    whether the technologies at hand are amenable to generalization (my
    guess is "no", hence a market)

Rick
-- 
 http://www.rickbradley.com    MUPRN: 337
                       |  least_ connections to
   random email haiku  |  a server just Depends on
                       |  your environment.