Re: [AMBER-Developers] reflector searching

From: Mark Williamson <mjw.sdsc.edu>
Date: Thu, 05 Feb 2009 17:28:42 -0800

Scott Brozell wrote:
> Hi,
>
> Searching the reflector is becoming ever more unproductive.
> It is difficult to quickly find good hits because
> the results can contain too many bad hits, the results are
> not sorted by date, and the results can contain too few hits
> due to spelling errors etc.
> (Googggle helps on the spelling errors unless the post is mispelled;-).
>
> For example, this reflector verbatim subject title:
> Xleap - solvatebox - counterions
> does not find the Jan 30 2009 thread:
> http://archive.ambermd.org/200901/0419.html
> Maybe it is too close to the post date ?
>
> And this search
> MG2
> finds lots of junk and is not date sorted.
>
> I dont expect this type of searching to be as productive as grepping
> my reflector mail file, but adding an option to sort by date seems
> very attractive result-wise and can be done by googgle advanced search;
> can we automagically do this ?
> This could be our first step to offering better searching.

Hi Scott,

First off, I agree with what you are saying. I've been quietly noticing
for a while (~2-3 years) that google is no longer as effective as it was
in its early days of the late 1990's. This I have found applies to
google in general, not just it's searching of the AMBER mail archive. I
sense the core reason behind this is that there are so many people out
to scam it (gain adword rank, etc) by generating tonnes of random web
pages with no meaningful content. The upshot is that the signal to noise
ratio for search hits drops.

Back to the question; Google results are, by default, sorted by
relevance. You can instruct it to show the pages it has indexed over the
last day by adding &as_qdr=d to the search string. This is essentially
filtering by date, not sorting by date. There is more information here:

http://www.mattcutts.com/blog/useful-google-feature-better-date-search/


Using some of the information within that link, one can query google to
see what it indexed from the amber archive over the past 7 days:

  http://www.google.com/search?q=site:archive.ambermd.org&as_qdr=d7

Which is not much, only one hit.

This information can be used to answer your question about the mail
being too close to the post date.

We'll refer to this email in this one hit from at past 7 days as (A)
(Compilation and usage of NAB in parallel, by Marek Mal)

  This email, according to what is in my mbox, was at:
  Date: Wed, 28 Jan 2009 23:25:24 +0100 (2:25pm PST)


and the non-indexed email to which Scott refers to ([AMBER] Xleap -
solvatebox - counterions, by arnaud), which we will refer to as (B) was at:
  Fri, 30 Jan 2009 11:44:58 +0100 (2:44am PST)

Given that: (from the crontab)

  the amber archiver checks its mail box at:
        1:07am PST {mon,wed,fri,sun}
  the amber archiver builds the webpage from mbox at:
        5:48am PST {mon,wed,fri,sun}

Email (A) must have been picked up at 1:07am PST on Friday 30th and then
presented to the world at 5:48am that same day.

Email (B) must have been picked up at 1:07am PST on Sunday 1st Feb and
then presented to the world at 5:48am that same day.

Hence google must have last indexed the site at some point before 5:48am
  Sunday 1st Feb.


As for telling google to sort by date, I'm not sure. I remember there
being a toggle on the results page which enabled one to switch between
"Sort by Date" and then back to "Sort by Relevance" in the results that
the search just returned on. This does not seem to be there any more and
I've no idea where it has gone. I will look into this some more.



As for improving the hits, one could focus on the quality of emails
making it to the list. I was toying with the idea of pruning "bad"
posts to the list. First off, any emails with no subject line would be
rejected and sent back to the user for attention. Next, emails with only
one word in the subject line would also be returned to the sender.

The next criteria could be a spell check; if a certain percentage of the
email contains spelling errors, it too would be sent back. However, this
one is tricky since the spell checker would flag up false positives if
someone pastes some computer related output.

Overall, filtering emails before they make it to the list is tricky and
subjective and a consensus should be sought before implementing
anything. The Orwellian slope in this situation is very steep and
slippy, but I do think rejecting subject-less posts is fair game.



With reference to using a local search engine, I looked at this ages
ago, but I did not consider what I found to be of much use and it was
this lack which was the motivation for letting google do the hard work
here on the searching front.

I think I may have to have a look at this option again or even consider
hacking up a web interface to `grep $search_string mbox` with some nice
parsing on the output.

Anyway, these are just the ramblings of a random ;)

regards,

Mark

_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Fri Feb 06 2009 - 01:25:25 PST
Custom Search