Create a List of a Domains Indexed Pages in Google
When putting a new web site or redesign project live that previously had a site, it is very important to redirect the indexed pages in Google to the new URL’s. This process is usually done by adding 301 (permanent redirects) to a .htaccess file within the root directory of your web server that contains your web site files.
The reason for doing these redirects is for a combination of reasons, firstly so that visitors from paid, natural results or links are directed to the relevant page and not just given a 404 page. Secondly and more importantly is for the SEO benefits to reduce the amount of links lost and so that Search Engines such as Google and Bing can identify and re-index the new links to replace the current SERP’s.
Identifying Indexed Pages in Search Engines
Now we know the basic reasoning for redirecting old indexed pages to new ones, we can look at how to identify what is currently indexed, which should preferably be done prior to a new site going live. To display your domains indexed pages in your preferred search engine, which in this article is Google as there are some great tricks for exporting the reports later in this article, simply open your web browser and navigate to Google. Once the search engine is open in your browser enter the following in the search box and click the search button:
site:www.yourdomainhere.com
The search results will now be a complete list of all indexed pages for the domain you searched on (replace; www.yourdomainamehere.com with your actual domain name).
Exporting Indexed Pages
The results may only return a handful of results which you can quite easily ctreate your 301 redirects from, but for larger sites, especially E-Commerce sites, you may want to eport or save the results to a CSV or Excel file. Luckily with the help of Google Docs, known now as Google Drive, you can do exactly that.
Navigate to Google Drive, login to your Google account and then create a new spreadsheet document. In the first cell on the document enter the following code, obviously replacing; www.yourdomainnamehere.com with your web sites actual domain name:
=importXml("https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=1"; "//cite")
This will give you the first 100 results in Google’s index for your domain. To get the next 100 results, simply increase the ‘&start=1’ to ‘&start=100’ and paste this into cell 101 right after the last lot of returned results, full code below:
=importXml("https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=100"; "//cite")
A maximum of a 1000 results can be pulled using this method, simply by clicking on the last available cell after the current results and incrementally increasing the ‘&start=’ number each time.
Once you have all of the results in your Google Drive, Spreadsheet Document, simply export the file as a CSV file or other type of file for a full list of indexed URL’s. From this exported file you can then systematically go through and add any required redirects to a .htacces file.
42 Responses
Steffen
April 15, 2013Works without a doubt. Thank you for sharing this…
luke
April 30, 2013hi this is very cool method thanks! I find on some it truncates them with a … if they are longer urls though – as does the search page itself in its “displayed” value, altough the links on search pages do work properly to full URL. Any ideas?
Porter
April 30, 2013In the actual search results you can click the ‘cached’ link which opens the page in a new window with an information box at the top of the browser window which details the full URL for you. As for seeing this in the Google Docs method, I havent seen the shorter URL’s.
Luke
September 10, 2013hi Mathew sorry I never replied back ! yeah the shorter urls are when a long url is captured, it inserts a … for part of the URL so it breaks (in the spreadsheet I mean). Still a useful technique for us Web Developers! Thanks.
Porter
September 10, 2013I have also noticed that it shortens url’s sometimes… Will have to have a look into an alternative method.
luke
September 19, 2013hey Mathew any joy with the URL shortned ones?
I got a nice tip, in the spreadsheet if you want to keep a ref of the Page title also beside your URLs. in the second column, run the command again. This time swap out //cite for //h3 – that way you get the title. This is useful for example if you are doing 301 s and want to know what a URL like e.g. content.php?id=204 would be.
Porter
September 26, 2013No Luke, it seems like an issue that a few people have encountered but unfortunately Google a year or 2 ago changed the attributes / data that you can export.
Will Wilson
October 14, 2013Excellent article and helped me get at the data I was after. Thanks!
Chris
October 21, 2013Only workable solution I’ve found – why on earth can’t Google just include an export button in Webmaster Tools?
Kirill
November 9, 2013Excellent & helpful command, thanks!
Joe
November 27, 2013This doesn’t work – I copied and pasted code and got “error: The xPath query did not return any data.”
Porter
December 1, 2013I dont know why you are having issue Joe, everyone else seems to have success? I will look at the error code and tried to come back to you with what may have caused it. Ensure its the first cell in the first column and first row that you paste the code in a new Google Docs spreadsheet.
Porter
April 3, 2014Hi Joe, I have also encountered the same error recently and its been since Google updated search to use an SSL and can simply be fixed by adding https:// in the query. The code has now been updated in the article and has been tested.
Malachy McConnell
December 13, 2013To avoid the URL shortening you can access the full url by changing the ‘xPath’ of the importXml function like so:
=importXml(“www.google.com/search?q=site:www.somedomain.com&num=100&start=1”; “//h3/a/@href”)
That is replace “//cite” with “//h3/a/@href”
There is still some cleaning up to do on the href attribute, which can be done with a simple text function to extract just the normal url instead of the Google contrived url. I used this function:
=mid(A1,search(“?q=”,A1)+3,search(“&sa=”,A1)-(search(“?q=”,A1)+3))
References:
xPath : http://www.w3schools.com/xpath/xpath_syntax.asp
Porter
April 3, 2014Great follow up Malachy.
Kim McCann
October 1, 2017Works great. Just be sure to edit left and right quotation marks to regular vertical quotes if you copy and paste from the examples above. You will otherwise get error message. I saved hours recording existing indexed pages, especially when making SEO reports or performing website re-dos and remodels and I want a record of original pages for any redirects that I might have to perform.
Mathew
October 27, 2017Thanks Kim. Yes, the character encoding for ” marks is off in the post.
Anne Stahl
January 10, 2014Great article and very helpful. Really love the simplicity of this.
Thank you!
I don’t suppose anyone knows how to get sites with over 1000 results indexed?
Porter
January 17, 2014Hi Anne, unfortunately using this method the limit is 1000. However using an application such as Scrapebox would pull back a full list that you can save as a spreadsheet for you.
D.Varga Tamas
January 26, 2014Hi Matthew,
I have Scrapebox and all of its plugins, but was unable to find a good tutorial on how to use it to fetch all indexed pages. Unfortunately some of our large Joomla sites generate duplicate content and we need to deindex them, but for this we need the list of indexed pages.
I would appreciate if you could explain how to do it with Scrapebox.
Many thanks in advance.
Porter
January 31, 2014Hi, I think this is the video tutorial for using ScrapeBox to find indexed pages and produce a file for export: http://www.youtube.com/watch?v=nEIYzcPFfvI
D.Varga Tamas
February 1, 2014Thanks Matthew, I know this feature, but its function is different from the solution you described above. I will continue experiment with this importXML function so that I can get the real urls that are in the site index.
Jake
April 10, 2014Hi Matthew,
Is there a way of adding ommitted results to the spreadsheet?
Jake
Porter
April 12, 2014Hi Jake, I will have to have a look into this for you buddy, a quick search mentioned replacing site: with allinurl: but I would assume they both would still bring back omitted results if your content is dynamic, so I will keep searching.
Jon Burr
May 12, 2015Mathew – thank you for this incredibly useful post.
I found that the syntax has changed – the query URL as posted no longer works.
Currently working version:
=IMPORTXML(“https://www.google.com/search?q=site:yourdomainhere.com&num=100&start=1&gws_rd=ssl”, “//cite”)
Notice the comma at the end of the first arguement instead of the colon, and the appended –
&gws_rd=ssl”
Porter
May 13, 2015Hi Jon, there is a new method that works well which I am currently writing a follow up post detailing its usage, so please check back soon for the solution.
George
March 8, 2016this worked for me best, seems others are out of date (relevant March 2016)
Rich
June 9, 2015I got the original =importXml(“https://www.google.com/search?q=site:www.yourdomainnamehere.com&num=100&start=1”; “//cite”) to work for me yesterday (June 8, 2015) and now today, it doesn’t work. Did something change? This was an awesome tool for listing all of my sites page in csv.
Porter
October 23, 2015Try my updated guide on our company blog, as I am not sure if all the updated steps are detailed in this post, I need to review it again. Here is the link: https://www.twdg.co.uk/blog/create-list-pages-urls-indexed-google-domain/
Brian
June 17, 2015Thanks Jon,
That helped a lot. My ImportXml had just stopped working despite working a month ago
Porter
October 23, 2015Hi guys, since the recent updates with Google, I have followed up with a new overview on my companies blog: https://www.twdg.co.uk/blog/create-list-pages-urls-indexed-google-domain/
Jacky
September 22, 2015That’s very helpful. Do you have a solution for sites with 100s of thousands of indexed pages?
Porter
October 23, 2015The easiest thing to start with would to place removal requests from within your Google Search Console account for the main directories that url’s are based from, so a removal request for the directory: mydomain.com/first-directory/ would remove all subsequent url’s based off that directory such as: mydomain.com/first-directory/link1/ and mydomain.com/first-directory/link2/ etc.
Diwali Wishes 2015
October 12, 2015WOW just what I was searching for. Came here by searching for google indexed pages
Ralf
March 23, 2016As of now doesn’t seem to work anymore.
Porter
April 6, 2016It should still work providing that your IP address hasn’t been flagged for to many lookups, but you may need to follow my updated guide here https://www.twdg.co.uk/blog/create-list-pages-urls-indexed-google-domain/
Pritesh
April 6, 2016Hi,
I tried this but not working in my case. I used folowing
=importXml(“https://www.google.com/search?q=site:nearfox.com&num=100&start=1”, “//h3/a/@href”)
Am I doing anything wrong here or google has made some changes?
Many thanks!
Porter
April 28, 2016Hi Pritesh, follow the updated guide on my company site here: https://www.twdg.co.uk/blog/create-list-pages-urls-indexed-google-domain/. Ensure that ” marks are not using any other encoded format quotes such as the code that your pasted, which should be: =importXml(“https://www.google.com/search?q=site:nearfox.com&num=100&start=1”, “//h3/a/@href”) and also ensure that the query is using the search engine index TDL that your site is indexed in, such as google.co.uk, google.com, google.jp etc.
Begona
May 7, 2016Hi,
I tried this but not working in my case. I only can see #N/A
The error is: can’t obtain URL
Many thanks
Porter
June 15, 2016The most common issues are: 1) Your sites not indexed in the Google index of the TLD being queried, so update the query to google.com, google.co.uk etc. which ever is the appropriate index that your site will be present. 2) Your IP address has been blocked due to exceeding Google’s query limits. 3) You query may have incorrect characters in it, usually the ” quotation marks. So please adjust and trial these points. Also ensure that you are following the latest guide, available on our company blog: https://www.twdg.co.uk/blog/create-list-pages-urls-indexed-google-domain/.
Paolo
October 24, 2017Doesn’t work. It says it can’t get url https://www.google.com etc. etc. 🙁
Mathew
October 27, 2017Hi Paulo, try my most recent version of this process over on my company blog: https://www.twdg.co.uk/blog/create-list-pages-urls-indexed-google-domain/ if retrieval errors persist, ensure that ” characters are not encoded and that your ip crawl limits have not been reached for the day for querying Google’s index.