• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

SKA news

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    #11
    Re: Relevance

    Perhaps Fiddle.

    It would be a shame for AtW to optimise his code now for speed, add in the other billion sites and then find the results are not what he expects.

    Comment


      #12
      Re: Relevance

      You can geuss what I am going to say...

      Comment


        #13
        Re: Relevance

        It would be a shame for AtW to optimise his code now for speed, add in the other billion sites and then find the results are not what he expects.
        Certainly it would be worth his while taking a look at the examples you have given and checking out why the results returned were what they were.

        My guess is that he has no pages indexed where the search phrase you entered appears as a phrase so it's taken each word seperately and used the rankings based on the first word.

        If you are searching for non-existant pages then google can give some pretyy stupid results - try finding one of threaded's companies for instance. Or maybe the page which shows 50% of UK drivers have an unspent conviction from a speed camera.

        *edit*

        Swindon College ?? - Looks reasonable to me..
        majestic12.kicks-ass.org:8888/search.jhh?q=Swindon+College

        I tried another hoping to be able to say I was disappointed by the results but even that looks vaguely close to requirements...

        majestic12.kicks-ass.org:8888/search.jhh?q=snatch+shot

        Comment


          #14
          Re: Relevance

          Searching for The Times gives 3 results. This site, and two of Atw’s sites.

          edit...
          And it clearly has indexed The Times newspaper site because if you type thetimes (without the space) you get the the times online site halfway down the list (after a reference on Atw's site)

          Comment


            #15
            Re: Relevance

            I was under the impression with the software ATW had on people like Mordacs machine indexing was supposed to take place quickly.

            Have you stopped indexing or would you like a few more of us to give you some processing power?

            Comment


              #16
              Re: Relevance

              The Times looks like a bug for sure - it's ignoring the word "the" majestic12.kicks-ass.org:8888/tools/wordsearch.jhh?q=the%20times and has 437k documents containing "times" but as you say returns only 3 which are irrelevant. Searching just for times is better but even then urls with "times" first rank no higher than blah-di-blah-times.

              Comment


                #17
                relevance

                Relevance is the key -- no doubt about it. But Fiddle is spot on stating simple fact -- my search engine right now simple has got limited data indexed so it gives best matches from what's available in it.

                Specifically current events are unlikely to be present because I index data starting from the earliest crawled (Oct-Dec) because we started from top sites and them gone deeper: first breadth then depth.

                Speed is important issue -- nobody is going to use search engine that takes many seconds to return possibly crappy results requiring to change query. I have code that can index and merge all 600 mln pages we have (it will take month to index), but current searching and relevant algorithms are not scalable to that level yet.

                Note -- you may see more of "my sites" than necessary because some tw@t face w@nkers thought it would be funny to have a domain name resolving to 127.0.0.1, because I had my own dev version of site on localhost it meant that fecking crawler crawled MY data for their domain >:

                I fixed crawler now but some historical data is affected, will need to weed that crap out.

                I have only been working on actual search engine for a month, just 4 weeks ago it had 1 mln pages, and now its 11 mlns, I am particularly pleased that I took care of very tricky part of getting anchor text associated with pages that may not have even been crawled yet

                I hope to have 100 mln pages in the search engine in 2 weeks, at that point indexing formats should be fixed enough to start indexing whole data corpus that I expect to have incorporated into search enging by the end of summer (going for 2 weeks break in early august).

                --

                Searching for
                thetimes gives TheTimesOnline as a link in #2 place, the reason site was not indexed is because theTimes tw@ts only allow Google to index their site >:

                I could have had the link on top if I tweaked scoring assigned -- it will be possible for users to tweak it themselves shortly

                Comment


                  #18
                  bug

                  > contractor uk fiddleabout

                  BobHope2 -- thanks you found a bug, it actually died with exception -- I am going to fix it today, thanks for finding it!

                  Comment


                    #19
                    code

                    > It would be a shame for AtW to optimise his code now for
                    > speed, add in the other billion sites and then find the results
                    > are not what he expects.

                    I agree in principle -- that's why I am slowly scaling code up starting from 1 mln, then 2 mln, now 11 mln, and will be doubling that regularly. The reason is to make sure code scales, particularly relevance wise.

                    But speed can't be ignored - it has to be fast (it pisses me off when search engine is slow so I can expect the same from others), and fast means less load on search -- more capacity.

                    Relevance is certainly important, I have done only basic work in this area because other more pressing issues had to be solved -- getting more URLs into database because with too few URLs you won't find those relevant items you expect because they might not be in there in the first place.

                    Comment


                      #20
                      Re: bug

                      AtW,

                      when you have burn't through the few grand you saved as a permy to build the SKA...

                      which wheel will you reinvent next time ?

                      Milan.

                      Comment

                      Working...
                      X