Indexed SERP pages
Search Engines, SEO

Create a List of a Domains Indexed Pages in Google

When putting a new web site or redesign project live that previously had a site, it is very important to redirect the indexed pages in Google to the new URL’s. This process is usually done by adding 301 (permanent redirects) to a .htaccess file within the root directory of your web server that contains your web site files.

The reason for doing these redirects is for a combination of reasons, firstly so that visitors from paid, natural results or links are directed to the relevant page and not just given a 404 page. Secondly and more importantly is for the SEO benefits to reduce the amount of links lost and so that Search Engines such as Google and Bing can identify and re-index the new links to replace the current SERP’s.

Identifying Indexed Pages in Search Engines

Now we know the basic reasoning for redirecting old indexed pages to new ones, we can look at how to identify what is currently indexed, which should preferably be done prior to a new site going live. To display your domains indexed pages in your preferred search engine, which in this article is Google as there are some great tricks for exporting the reports later in this article, simply open your web browser and navigate to Google. Once the search engine is open in your browser enter the following in the search box and click the search button:

site:www.yourdomainhere.com

The search results will now be a complete list of all indexed pages for the domain you searched on (replace; www.yourdomainamehere.com with your actual domain name).

Exporting Indexed Pages

The results may only return a handful of results which you can quite easily ctreate your 301 redirects from, but for larger sites, especially E-Commerce sites, you may want to eport or save the results to a CSV or Excel file. Luckily with the help of Google Docs, known now as Google Drive, you can do exactly that.

Navigate to Google Drive, login to your Google account and then create a new spreadsheet document. In the first cell on the document enter the following code, obviously replacing; www.yourdomainnamehere.com with your web sites actual domain name:

=importXml("https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=1"; "//cite")

This will give you the first 100 results in Google’s index for your domain. To get the next 100 results, simply increase the ‘&start=1′ to ‘&start=100′ and paste this into cell 101 right after the last lot of returned results, full code below:

=importXml("https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=100"; "//cite")

A maximum of a 1000 results can be pulled using this method, simply by clicking on the last available cell after the current results and incrementally increasing the ‘&start=’ number each time.

Once you have all of the results in your Google Drive, Spreadsheet Document, simply export the file as a CSV file or other type of file for a full list of indexed URL’s. From this exported file you can then systematically go through and add any required redirects to a .htacces file.


22 Responses

  1. Steffen
    April 15, 2013

    Works without a doubt. Thank you for sharing this…

    Reply
  2. luke
    April 30, 2013

    hi this is very cool method thanks! I find on some it truncates them with a … if they are longer urls though – as does the search page itself in its “displayed” value, altough the links on search pages do work properly to full URL. Any ideas?

    Reply
    • Porter
      April 30, 2013

      In the actual search results you can click the ‘cached’ link which opens the page in a new window with an information box at the top of the browser window which details the full URL for you. As for seeing this in the Google Docs method, I havent seen the shorter URL’s.

      Reply
      • Luke
        September 10, 2013

        hi Mathew sorry I never replied back ! yeah the shorter urls are when a long url is captured, it inserts a … for part of the URL so it breaks (in the spreadsheet I mean). Still a useful technique for us Web Developers! Thanks.

        Reply
  3. Porter
    September 10, 2013

    I have also noticed that it shortens url’s sometimes… Will have to have a look into an alternative method.

    Reply
  4. luke
    September 19, 2013

    hey Mathew any joy with the URL shortned ones?

    I got a nice tip, in the spreadsheet if you want to keep a ref of the Page title also beside your URLs. in the second column, run the command again. This time swap out //cite for //h3 – that way you get the title. This is useful for example if you are doing 301 s and want to know what a URL like e.g. content.php?id=204 would be.

    Reply
    • Porter
      September 26, 2013

      No Luke, it seems like an issue that a few people have encountered but unfortunately Google a year or 2 ago changed the attributes / data that you can export.

      Reply
  5. Will Wilson
    October 14, 2013

    Excellent article and helped me get at the data I was after. Thanks!

    Reply
  6. Chris
    October 21, 2013

    Only workable solution I’ve found – why on earth can’t Google just include an export button in Webmaster Tools?

    Reply
  7. Kirill
    November 9, 2013

    Excellent & helpful command, thanks!

    Reply
  8. Joe
    November 27, 2013

    This doesn’t work – I copied and pasted code and got “error: The xPath query did not return any data.”

    Reply
    • Porter
      December 1, 2013

      I dont know why you are having issue Joe, everyone else seems to have success? I will look at the error code and tried to come back to you with what may have caused it. Ensure its the first cell in the first column and first row that you paste the code in a new Google Docs spreadsheet.

      Reply
    • Porter
      April 3, 2014

      Hi Joe, I have also encountered the same error recently and its been since Google updated search to use an SSL and can simply be fixed by adding https:// in the query. The code has now been updated in the article and has been tested.

      Reply
  9. Malachy McConnell
    December 13, 2013

    To avoid the URL shortening you can access the full url by changing the ‘xPath’ of the importXml function like so:

    =importXml(“www.google.com/search?q=site:www.somedomain.com&num=100&start=1″; “//h3/a/@href”)

    That is replace “//cite” with “//h3/a/@href”

    There is still some cleaning up to do on the href attribute, which can be done with a simple text function to extract just the normal url instead of the Google contrived url. I used this function:

    =mid(A1,search(“?q=”,A1)+3,search(“&sa=”,A1)-(search(“?q=”,A1)+3))

    References:
    xPath : http://www.w3schools.com/xpath/xpath_syntax.asp

    Reply
  10. Anne Stahl
    January 10, 2014

    Great article and very helpful. Really love the simplicity of this.
    Thank you!
    I don’t suppose anyone knows how to get sites with over 1000 results indexed?

    Reply
    • Porter
      January 17, 2014

      Hi Anne, unfortunately using this method the limit is 1000. However using an application such as Scrapebox would pull back a full list that you can save as a spreadsheet for you.

      Reply
  11. D.Varga Tamas
    January 26, 2014

    Hi Matthew,

    I have Scrapebox and all of its plugins, but was unable to find a good tutorial on how to use it to fetch all indexed pages. Unfortunately some of our large Joomla sites generate duplicate content and we need to deindex them, but for this we need the list of indexed pages.

    I would appreciate if you could explain how to do it with Scrapebox.

    Many thanks in advance.

    Reply
    • Porter
      January 31, 2014

      Hi, I think this is the video tutorial for using ScrapeBox to find indexed pages and produce a file for export: http://www.youtube.com/watch?v=nEIYzcPFfvI

      Reply
      • D.Varga Tamas
        February 1, 2014

        Thanks Matthew, I know this feature, but its function is different from the solution you described above. I will continue experiment with this importXML function so that I can get the real urls that are in the site index.

        Reply
  12. Jake
    April 10, 2014

    Hi Matthew,

    Is there a way of adding ommitted results to the spreadsheet?

    Jake

    Reply
    • Porter
      April 12, 2014

      Hi Jake, I will have to have a look into this for you buddy, a quick search mentioned replacing site: with allinurl: but I would assume they both would still bring back omitted results if your content is dynamic, so I will keep searching.

      Reply

Leave a Reply