• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Tool to inspect a website structure?

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Tool to inspect a website structure?

    Are there any tools which will generate a nice report of pages on a specific site/domain... i.e. finding pages which are publicly accessible but not linked from the main site?
    Originally posted by MaryPoppins
    I'd still not breastfeed a nazi
    Originally posted by vetran
    Urine is quite nourishing

    #2
    Have you tried this?
    Best Forum Advisor 2014
    Work in the public sector? You can read my FAQ here
    Click here to get 15% off your first year's IPSE membership

    Comment


      #3
      Yeah, Marillionfan..but he is expensive

      Sorted.
      'CUK forum personality of 2011 - Winner - Yes really!!!!

      Comment


        #4
        Originally posted by TheFaQQer View Post
        Have you tried this?
        Any particular reason you felt justified in using LMGTFY when the search phrase contains a technical term?

        Those tools are also NOT what I asked for, they seem to work by crawling recursively from the homepage... meaning they'd miss pages that aren't reachable by following links?
        Last edited by d000hg; 18 January 2012, 08:51.
        Originally posted by MaryPoppins
        I'd still not breastfeed a nazi
        Originally posted by vetran
        Urine is quite nourishing

        Comment


          #5
          Originally posted by d000hg View Post
          Are there any tools which will generate a nice report of pages on a specific site/domain... i.e. finding pages which are publicly accessible but not linked from the main site?
          Look to see if the site has a Robots.txt file. often in there you can find reference to parts of the site the owner doesn't want crawled by search engines..

          Besides clues Like this, unless directory browsing is enabled with no default page, not sure how you can find pages not linked from the site.

          Comment


            #6
            If it's not your site and they've got security blocking folder/directory browsing then it doesn't appear to be a simple task.

            You could compare older versions of the site via the Wayback Machine.

            Have a search for tools that locate orphaned web pages/files as a reasonable starting point, assuming you want to identify pages that are still accessible but not via normal link navigation so using a website spidering tool won't work.
            Feist - 1234. One camera, one take, no editing. Superb. How they did it
            Feist - I Feel It All
            Feist - The Bad In Each Other (Later With Jools Holland)

            Comment


              #7
              If the pages aren't public, they what is going to know that they are there?

              Search engines aren't going to find them, since they aren't anything that you can crawl through.

              If you use something that will download the entire site, then it will follow links to find the pages, so that's not going to be any use.

              If you own the site, then there are tools you can use to find the orphaned pages, but for a site which you have nothing to do where a directory is secured in any way, then you aren't going to get anything from there.
              Best Forum Advisor 2014
              Work in the public sector? You can read my FAQ here
              Click here to get 15% off your first year's IPSE membership

              Comment


                #8
                Originally posted by PAH View Post
                Have a search for tools that locate orphaned web pages/files as a reasonable starting point, assuming you want to identify pages that are still accessible but not via normal link navigation so using a website spidering tool won't work.
                Exactly

                Originally posted by TheFaQQer View Post
                If the pages aren't public, they what is going to know that they are there?
                That is the question being asked. When a new site goes up Google finds it and crawls the home-page... how does it find the home-page in the first place?

                I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?
                Originally posted by MaryPoppins
                I'd still not breastfeed a nazi
                Originally posted by vetran
                Urine is quite nourishing

                Comment


                  #9
                  Originally posted by d000hg View Post
                  I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?

                  Nope. Google uses links to find pages. A new site needs to be linked to from another site for Google to find it, or you can manually submit a site or page to Google for adding to their index. There's a special page on Google somewhere to do that.

                  The only way a page that's not linked to may be found is if it uses dynamic URLs where there's something on the querystring to identify the page content to return, such as 'page=1'. Then it may be possible some search engines would use an incrementer to find all possible entries, but I wouldn't rely on it.
                  Feist - 1234. One camera, one take, no editing. Superb. How they did it
                  Feist - I Feel It All
                  Feist - The Bad In Each Other (Later With Jools Holland)

                  Comment


                    #10
                    Originally posted by d000hg View Post
                    Exactly

                    That is the question being asked. When a new site goes up Google finds it and crawls the home-page... how does it find the home-page in the first place?

                    I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?
                    Google can only find it if it has been told where it is. It does this by...

                    1) Having the page submitted manually to Google which you can do here Overview ? Submit your content
                    Make it a page with either a sitemap xml or a lot of links through your page (like a sitemap page). Google then crawls all the links. Submit a single page with no links in our out and it will take that page alone once, bugger off and never return.

                    2) Have links from other pages that google rates (for faster and more frequent crawling) and it's spider will come visit you at some point. Paid links or relevant content links. The more relative the better google will deem it and more likely rate higher.

                    3) Submit to user generated sites like DMOZ but because it is user authenticated it can take forever.

                    Google AFAIK does not document new pages that appear out of the blue. It has to be connected for the spiders to find it.. No linkey no likey....
                    Last edited by northernladuk; 18 January 2012, 13:29.
                    'CUK forum personality of 2011 - Winner - Yes really!!!!

                    Comment

                    Working...
                    X