How to Scrape Images From the Web in Python

A Python image scraper isn’t just a tool for sharpening your programming skills. You can also use it to source images for a machine learning project, or generate site thumbnails. While there may be other ways to do similar things, nothing can beat the control you have using tools you build yourself.

Learn how to scrape images from any website using Python and the BeautifulSoup library.

4

Likemore generalized web scraping, image scraping is a method for downloading website content. It’s not illegal, but there are some rules and best practices you should follow. First, you should avoid scraping a website if it explicitly states that it does not want you to. You can find this out by looking for a/robots.txtfile on the target site.

Most websites allow web crawling because they want search engines to index their content. You can scrape such websites since their images are publicly available.

iCloud+ Website on MacBook Sitting on Kitchen Island

However, just because you may download an image, that doesn’t mean you can use it as if it were your own. Most websites license their images to prevent you from republishing them or reusing them in other ways. Always assume that you cannot reuse images unless there is a specific exemption.

Python Package Set Up

You’ll need to install a few packages before you begin. If you don’t have Python installed on your computer, visit the officialpython.orgwebsite to download and install the latest version.

Next, open your terminal to your project folder andactivate a Python virtual environmentto isolate your dependencies.

Spotify Daylist on an iPad with AirPods nearby

Finally, install therequestsandBeautifulSouppackages using pip:

Image Scraping With Python

For this image scraping tutorial, you’ll use therequestslibrary to fetch a web page containing the target images. You’ll then pass the response from that website intoBeautifulSoupto grab all image link addresses fromimgtags. You’ll then write each image file into a folder to download the images.

How to Fetch Image URLs With Python’s BeautifulSoup

Now go ahead and create a Python file in your project root folder. Ensure that you append the.pyextension to the filename.

Each code snippet in this tutorial continues from the previous one.

iPhone in hand showing keyboard and paste option in notes app

Open the Python file withany good code editorand use the following code to request a web page:

If the above program outputs a 200 response code, the request was successful. Otherwise, you might want to ensure that your network connection is stable. Also, ensure that you’ve supplied a valid URL.

Hand scraping behind blocks of code

Now useBeautifulSoupto read the content of the web page with the aid of thehtml_parser:

This code creates a list of objects, each representing an image from the web page. However, what you need from this data is the text of each image’ssrcattribute.

To extract the source from eachimgtag:

Rerun your code, and the image addresses should now appear in a new list (imageSources). You’ve successfully extracted each image source from the target web page.

How to Save the Images With Python

First, create a download destination folder in your project root directory and name itimages.

For Python to successfully download the images, their paths need to be full absolute URLs. In other words, they need to include the “http://” or “https://” prefix, plus the full domain of the website. If the web page references its images using relative URLs, you’ll need to convert them into absolute URLs.

In the easy case, when the URL is absolute, initiating the download is just a case of requesting each image from the earlier extracted sources:

Theimage.split('/')[-1]keyword splits the image link at every forward-slash (/). Then it retrieves the image file name (including any extension) from the last element.

Bear in mind that, in rare cases, image filenames might clash, resulting in download overwrites. Feel free to explore solutions to this problem as an extension to this example.

Absolute URLs can get quite complicated, with lots of edge cases to cover. Fortunately, there’s a useful method in therequests.compatpackage calledurljoin. This method returns a full URL, given a base URL and a URL which may be relative. It allows you to resolve values you’ll find inhrefandsrcattributes.

The final code looks like this:

Never Go Short of Image Data

Many image recognition projects hit a brick wall due to an inadequate amount of images to train a model. But you can always scrape images from websites to boost your data repository. And thankfully, Python is a powerful image scraper you can use continuously without the fear of getting priced out.

If you’re interested in fetching other types of data from the web, you might want to find out how to use Python for general web scraping.

Interested in web scraping? Here’s how to scrape a website for content and more with the Beautiful Soup Python library.

My foolproof plan is to use Windows 10 until 2030, with the latest security updates.

Windows is great, but adding this makes it unstoppable.

If an AI can roast you, it can also prep you for emergencies.

You’ve been quoting these famous films wrong all along!

Now, I actually finish the books I start.

Technology Explained

PC & Mobile