Literature review

Revision as of 19:20, 16 December 2019 by (talk) (Created page with "= Aims and Scope = = Data collection and cleaning procedure = We will be using Publish or Perish (PoP) to conduct all queries, which will accessing Google Scholar, Scopus and...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Aims and Scope

Data collection and cleaning procedure

We will be using Publish or Perish (PoP) to conduct all queries, which will accessing Google Scholar, Scopus and Web of Science. Each have their own specifications, so the keywords will be defined in a spreadsheet and then concatenated to generate usable strings. These databases impose string character limits for search strings, so we will be generating many small queries rather than lengthier strings that contain multiple keywords. This will generate a massive amount of data with lots of overlap, but there are ways of dealing with this (this is the purpose of this page). One benefit is that we will get a more granular sense of what concepts relate to each item generated in the query results, which may be very valuable during the analytical stage.

Search parameters

Here are some notes pertaining to the search parameters that we will be using:

Google Scholar

  • Character limit is 100 per query
  • To specify keyword in publication title - source:<xx> ()
  • 1000 results per query by default
  • Does not provide DOIs
  • Can specify to search titles only
  • No API, must use Publish or Perish


  • 200 results per query by default
  • Implicit synonym-matching/autocorrect (archaeology/archeology are interchangeable, but archaeological/archeological may not be)
  • To specify keyword in publication title: SRCTITLE()
  • To specify keyword in title only: TITLE()
  • In PoP, journal keywords are entered in separate field
  • Journal field (SRCTITLE()) does not allow wildcards or boolean terms (AND/OR/NOT/*)
  • Limited recent publications
  • Punctuation is ignored: heart-attack or heart attack return the same results
  • The hyphen is treated as punctuation and therefore ignored if it is not in an exact phrase
  • Wildcards must be used with words because they cannot be standalone
  • When an hyphen is placed between a wildcard and a word, the wildcard will be dropped, e.g.:
    • title-abs-key (*-art) will be searched as title-abs-key(art)
    • abs(iwv-*) will be searched as abs(iwv)
  • To find documents that contain an exact phrase, enclose the phrase in braces: {oyster toadfish}
    • {heart-attack} and {heart attack} will return different results because the dash is included.
    • Wildcards are searched as actual characters, e.g. {health care?} returns results such as: Who pays for health care?

Web of Science

  • Can only search titles
  • Each search term in the query must be explicitly tagged with a field tag. Different fields must be connected with search operators.
  • Extraneous spaces are ignored by the product. For example, extra spaces around opening and closing parentheses ( ) and equal (=) signs are ignored.
  • The dollar sign ($) is useful for finding both the British and American spellings of the same word. For example, flavo$r finds flavor and flavour.
  • The search engine treats hyphens (-) and apostrophes (') in names as spaces. For example:
    • AU=O Brien returns the same number of results as AU=O'Brien.
  • More info:


API access in R


Web of Science:


Google Scholar: N/A

Data cleaning

Some of the keywords remain vague and will generate too many irrelevant results, so we need to devise a methodological approach to weed out irrelevant items in bulk. Costis had suggested that we remove a certain amount from the items with the lowerst PageRank when less than a certain threshold amount of that subset are deemed relevant by a human reviewer. However, this remains somewhat unclear to me and it would be very helpful if Costis could write this out in more detail. \\ \\ Verify that the results are sorted by relevance, and then export them from PoP as BibTex (.bib) files. We will then import them into Zotero as independent collections.

Zotero does not have batch import functionality, so I'm trying to figure out a workaround that would save us time and energy. Here's what I propose:

1. Use the Web API to create the collections.

2. Go through the collections and import the contents of bib files via the clipboard (control + shift + command + i on a mac)

  • Can't import the actual bibliographic items using the API, it's limited to 50 write commands per write request.

We will use Zotero's merge function to combine items with matching metadata across all collections. Their association with each collection will be maintained, but there will be only one item spread across them. We will therefore be able to combine the different sets of metadata provided by each database for overlapping items. This will be crucial, since Google Scholar does not include DOIs in their results, but Scopus and Web of Science do. For the remainder, we will use the zotero-shortdoi plugin:

After this is done, we will export a .bib file for each collection, and pass them into an R script I wrote that uses the DOI to query the CrossRef database, obtain article abstracts when they are available, and export a new .bib file with that metadata included We will then re-import those .bib files to Zotero, and begin weeding out irrelevant items based on their abstracts. We may need to look up the article and obtain the abstract manually if abstracts were not included in the CrossRef database. This final stage of sorting and cleaning the data will generate a list of around 100-120 articles, whose full-text PDFs will be imported to MaxQDA for qualitative coding.