Advanced Product Research on Amazon: Personalized Keyword Scraping Techniques
Overview
Are you someone who uses Amazon.com as their main source of data when conducting product research? Imagine that you are able to locate the product you are looking for on Amazon.com, but you are unable to analyze the data displayed on the website because it is not on your computer. You have two options: either hire an expensive freelancer to guarantee high-quality results, or take on the laborious and time-consuming task of coding to scrape the data. But in the long run, this might end up being very costly, particularly if you need to collect data frequently. You will pay an extra dollar for each piece of data, so it is a financial waste.
Personalized Product Scanning on Amazon
You enter the desired keyword at the beginning of the process to have a customized Amazon product scraping service created for you. Using this keyword, data will be searched on Amazon and a preview of the results will be displayed, akin to an Amazon page. When the results show up, you can check to see if the information matches your expectations and use more precise keyword searches to locate specific items. Lastly, a CSV or Excel file containing the data can be downloaded.
Allow me to explain this project to you through an example. Suppose our goal is to locate a pair of socks on Amazon. Type the term and press Enter. This is what will appear on the Amazon page.
Similar to Amazon, all you have to do to use this customized scraping application is enter the keyword from the first page here:
Let us now contrast the outcome of this project with that of the actual website:
As you can see, the outcome is essentially the same. To access the additional files, click the download button and select your preferred format.
Now you can get data such as product name, price, rating, image and product url in Excel and CSV format.
excel result
Although these scraping results may be helpful for your product research, there may be other data that we need to scrape. Additionally, we may require another web source besides Amazon. I can show you how to customize this program to meet your specific needs.
To proceed, we’ll need to discuss more technical details. It’s important that you have intermediate web scraping skills using Python. We’ll also be using the Flask library and Bootstrap for web development.
Create Your Own Custom Scraping Program
There are a lot of libraries to help us scrape. For the Amazon website, we can use the httpx library. It is pretty much the same as requests, but much faster.
Our first code will be creating an object scraper. This object will create a session that we will reuse for every scraping request. Using the same session is much faster and more efficient than creating a new one.
Remember, scraping a website needs a URL. Rather than using a fixed URL, we will use a URL based on user input. We need a function to convert string input to a valid URL.
def keyword_to_url(str_input)->str:
text = str_input.replace(" ","+")
url = https://www.amazon.com/s?k={text}"
return url
What if we need to scrape another website? You need to analyze the changes on the URL tab on your desired websites. For the Amazon case, if the user inputs “sock for women,” the URL will change to https://www.amazon.com/s?k=socks+for+women&crid=1GWE9J4KHX1T9&sprefix=socks%2Caps%2C388&ref=nb_sb_ss_ts-doa-p_2_5
Notice the URL in bold; it is less Amazon needed and you can ignore the rest. If your website is https://www.yourcustomwebsite.com/, the code will be like this:
def keyword_to_url(str_input)->str:
text = str_input.replace(" ","+")
url = https://www.yourcustomwebsite.com/s?k={text}"
return url
Remember, the variable text is the user input and if the user input is more than 1 word, the whitespace will replaced with “+”. Now add these lines in the bottom of our code to parse the HTML.
def check_url(url:str)->str:
prefix = "https://www.amazon.com"
pattern = r"(/[^/]+/dp/[^/]+)/"
match = re.search(pattern, url)
if match:
shorten_url = match.group(1)
else:
shorten_url = urlIf the prefix is not in shorten_url:
complete_url = prefix+shorten_url
return complete_url
else:
return shorten_urldef find_tag_list(html:str)->list[BeautifulSoup]:
soup = BeautifulSoup(html,"html.parser")
items = soup.find("div","s-main-slot s-result-list s-search-results sg-row")
class_16_data = "sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-16-of-20 sg-col s-widget-spacing-small sg-col-12-of-16"
class_48_data = "sg-col-4-of-24 sg-col-4-of-12 s-result-item s-asin sg-col-4-of-16 sg-col s-widget-spacing-small sg-col-4-of-20"
if items.find_all("div",class_=class_16_data):
list_of_tag = items.find_all("div",class_=class_16_data)
elif items.find_all("div",class_=class_48_data):
list_of_tag = items.find_all("div",class_=class_48_data)
else:
print("No tag class found")
raise
return list_of_tagdef parse(soup:BeautifulSoup)->dict:
product_name = find_tag(soup,"a","a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal")
product_url = find_tag(soup,"a","a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal","href")
product_image = find_tag(soup,"img,"s-image,"src")
price = find_tag(soup,"span","a-offscreen")
rating = soup.find("div","a-row a-size-small")
rating = find_tag(soup=rating,tag="span",attr="aria-label")
item = {"product_name":product_name,
"price":price.replace("$",""),
"rating":rating.replace(" out of 5 stars",""),
"image":product_image,
"product_url":check_url(product_url)}
return itemdef find_tag(soup:BeautifulSoup,tag:str,class_name="",attr="")->str:
try:
tag = soup.find(tag,class_=class_name)
if attr == "":
text = tag.text.strip()
else:
text = tag[attr]
except AttributeError:
text = ""
return textdef extract_html(html:str)->list[dict]:
list_of_tag = find_tag_list(html)
all_data = [parse(item) for item in list_of_tag]
return all_data
Let me bring you one by one. Check_url() is a function to shorten and validate the URL. The find_tag() function will find the data by the tag and handle AttributeError when it can’t find the wanted tag. The parse() function will extract data for each product, while the extract_html() function will extract all the items on the page and return it as a list.
Now let’s combine our scraper object that generates HTML content and our parsing function that extracts the result in the main() function.
def main(keyword):
url = keyword_to_url(keyword)
scraper = Scraper()
html = scraper.get_html(url)
result = extract_html(html)
return result
if __name__=="__main__":
keyword = input("Input keyword: ")
print(main(keyword))
Now, run the script so we know if it is working. Input "socks,” and the output will be like this:
terminal output
Building Web App with Flask
As a result of our scraping success, we will display the output in a web app using Flask. Create an HTML template consisting of 3 files.
index.html will contain a header and footer template that we will use on the other page as well.
homepage.html will contain our keyword search input.
scraping.html will display data like a real online bookstore.
Now add 2 more flask files. One will handle the routing and views page. We called this one views.py :
from flask import Blueprint, render_template, request, send_file
from scrap import mainimport pandas
bp = Blueprint("main",__name__)@bp.route('/')
def index():
return render_template("homepage.html")@bp.route('/scraping',methods=["GET","POST"])
def scraping():
print(request.form)
keyword = request.form["keyword"]
print(keyword)
items = main(keyword)
df = pandas.DataFrame(items)
df.to_csv("result.csv",index=False)
df.to_excel("result.xlsx",index=False) return render_template("/scraping.html", items=items)@bp.route('/download_csv')
def download_csv():
return send_file("result.csv")@bp.route('/download_excel')
def download_excel():
return send_file("result.xlsx")
Before we move on, take note of the scraping() function. From the input user, we called our main() function on the scrap.py file and returned the scraping result. Then the result is exported in CSV and Excel files.
Customize your output. Sure! Pandas supports exporting files to an SQL database, a Google BigQuery table, or even an HDF5 file using HDFStore. Check the complete list here.
Now the second flask file will be the core of the app. We named it app.py.
from flask import Flask
from views import bp
def create_app():
app = Flask(__name__)
app.register_blueprint(bp) return appif __name__=="__main__":
app = create_app()
app.run(debug=True)
Now run your app on your computer by hitting “flask run” on the terminal.
In summary
Our goal was to use web scraping to collect data for our specific requirements. The secret is to identify the most productive and efficient scraping technique, library, and tools. The most popular library is called Requests, but HTTPX is more feature-rich and faster.
You must be aware of the structure of each website if you wish to scrape more than one. The layout and structure of every website vary, which has an impact on the data extraction process. To find the precise elements that hold the desired data, you must examine the HTML structure of the websites you are targeting.
Making a web application is advantageous. Your options for a tech stack or framework are not restricted when creating a custom scraping web application. You can easily and quickly create your web application with Flask. The primary factor in our decision to use Flask as our web development framework was its simplicity.
Any queries?
To discuss your intricate product research or to request a project source, please leave a comment below. I would be happy to assist.