Tool to inspect a website structure?

**TheFaQQer** · 17 January 2012, 22:05

Have you tried this?

**northernladuk** · 17 January 2012, 22:20

Yeah, Marillionfan..but he is expensive

Sorted.

**d000hg** · 18 January 2012, 08:47

Originally posted by TheFaQQer View Post

Have you tried this?

Any particular reason you felt justified in using LMGTFY when the search phrase contains a technical term?

Those tools are also NOT what I asked for, they seem to work by crawling recursively from the homepage... meaning they'd miss pages that aren't reachable by following links?

**Joeman** · 18 January 2012, 09:34

Originally posted by d000hg View Post

Are there any tools which will generate a nice report of pages on a specific site/domain... i.e. finding pages which are publicly accessible but not linked from the main site?

Look to see if the site has a Robots.txt file. often in there you can find reference to parts of the site the owner doesn't want crawled by search engines..

Besides clues Like this, unless directory browsing is enabled with no default page, not sure how you can find pages not linked from the site.

**PAH** · 18 January 2012, 12:15

If it's not your site and they've got security blocking folder/directory browsing then it doesn't appear to be a simple task.

You could compare older versions of the site via the Wayback Machine.

Have a search for tools that locate orphaned web pages/files as a reasonable starting point, assuming you want to identify pages that are still accessible but not via normal link navigation so using a website spidering tool won't work.

**TheFaQQer** · 18 January 2012, 12:40

If the pages aren't public, they what is going to know that they are there?

Search engines aren't going to find them, since they aren't anything that you can crawl through.

If you use something that will download the entire site, then it will follow links to find the pages, so that's not going to be any use.

If you own the site, then there are tools you can use to find the orphaned pages, but for a site which you have nothing to do where a directory is secured in any way, then you aren't going to get anything from there.

**d000hg** · 18 January 2012, 13:07

Originally posted by PAH View Post

Have a search for tools that locate orphaned web pages/files as a reasonable starting point, assuming you want to identify pages that are still accessible but not via normal link navigation so using a website spidering tool won't work.

Exactly

Originally posted by TheFaQQer View Post

If the pages aren't public, they what is going to know that they are there?

That is the question being asked. When a new site goes up Google finds it and crawls the home-page... how does it find the home-page in the first place?

I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?

**PAH** · 18 January 2012, 13:19

Originally posted by d000hg View Post

I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?

Nope. Google uses links to find pages. A new site needs to be linked to from another site for Google to find it, or you can manually submit a site or page to Google for adding to their index. There's a special page on Google somewhere to do that.

The only way a page that's not linked to may be found is if it uses dynamic URLs where there's something on the querystring to identify the page content to return, such as 'page=1'. Then it may be possible some search engines would use an incrementer to find all possible entries, but I wouldn't rely on it.

**northernladuk** · 18 January 2012, 13:26

Originally posted by d000hg View Post

Exactly

That is the question being asked. When a new site goes up Google finds it and crawls the home-page... how does it find the home-page in the first place?

I always thought if I put up a page mysite.com/some_random_page.html, Google would find it and index it even if my homepage doesn't link to it. Not the case?

Google can only find it if it has been told where it is. It does this by...

1) Having the page submitted manually to Google which you can do here Overview ? Submit your content
Make it a page with either a sitemap xml or a lot of links through your page (like a sitemap page). Google then crawls all the links. Submit a single page with no links in our out and it will take that page alone once, bugger off and never return.

2) Have links from other pages that google rates (for faster and more frequent crawling) and it's spider will come visit you at some point. Paid links or relevant content links. The more relative the better google will deem it and more likely rate higher.

3) Submit to user generated sites like DMOZ but because it is user authenticated it can take forever.

Google AFAIK does not document new pages that appear out of the blue. It has to be connected for the spiders to find it.. No linkey no likey....

Tool to inspect a website structure?