Intro to Web Scraping with Python and Beautiful Soup

Intro to Web Scraping with Python and Beautiful Soup


Hello, ladies and gentlemen of the Internet.
My name is Phuc Duong, Senior Data Engineer for Data Science Dojo. And I’m here to teach
you how to web scrape with Python. So in front of you, you see, is actually a
website that employs web scraping. So this web scrape’s actually a storefront of a website
called Steam. So steam sells video games. And the cool thing about Steam is that they
do flash sales every day. So the user has to come back every day and study this page.
What is a good deal? What is not a good deal? And it’s a lot of information. This is how
they’ve gamified shopping online. Now there’s a website that actually scrapes
steam’s front page in real time and shows you the best deals, and ranks them. OK. So
a lot of people ask me, how do I get all of my data? And actually, in the absence of APIs,
if you learn, web scraping, it is actually a very important tool for data scientists
and a data engineer to know, because the entire internet becomes your database. So– I can scrape any storefront– Nordstrom,
Macy’s. Study the sales. Web scrape reviews. I can web scrape baseball stats, baseball
players in real time. Wikipedia is also a good place to web scrape. For example, you
can see that this frame over here of this Harry Potter character, Ron Weasley, it’s
very standardized. I could write a web scrape script and then loop over every single Harry
Potter character, very quickly, and create a data set. All right. Today we’re going to learn how
to do that. So today I’m on Windows, so you can normally install Python if you’re on Linux,
but if I don’t if you’re on Windows, I highly recommend installing Anaconda instead. So
if you go to Google, and just type in Anaconda, it should be a continuum dot and you should
just download based upon your operating system. OK. Next thing I’ll be using, I’ll be using a
text editor called Sublime Text. So you can just go ahead and go to Google and type in
Sublime Text and then install that. I like using Sublime Text 3. OK. All right. That’s
where you get those things. All right. So once you’ve installed this,
this is actually– if you’re using Anaconda, its actually a pretty big file. It’s like
500 megabytes, OK? So be warned of that. All right. So what I’m going to do is I’m
going to go ahead and open up my command line. And for those of you who don’t know, if you
go to a folder, any folder, and then just hold down the Shift button and right click
and say Open Command Window Here, this opens up the command line for you. And this is where
you can work with Python. So if you type in Python right here, right, and if you’ve installed
either Python or Anaconda, well this is show up, right? So notice that I’m using Python
3.5 with Anaconda. And if I just do a very quick two plus two, it should equal 4. That’s
how I know I’m inside of my console. All right, next thing is yes, now that I know
that if I push down control and hit C, Control plus C, basically if I do a copy on Windows,
it will exit this console. OK and I get back to, basically, the Windows command line. So what I’m going to do now is, I’m going
to go ahead and install a package called Beautiful Soup. That’s the package that we’re going
to use to web scrape, actually. It’s a very powerful package. I encourage those of you
who want to go further beyond this introduction to go ahead and learn this package. So all
you’ve got to do is do a pip, install, bs4. OK, bs4 stands for Beautiful Soup 4. So here we are. So Beautiful Soup has been
installed. And how do I know if it’s been installed? Well, if I type in python, and
I type in import bs4, right? It should just not err. OK. Awesome. So that’s how I know
that the packet is online and ready to go. Next thing I want is, I need a web client.
So Beautiful Soup is a good way to parse HTML text. That’s all is. It’s a good way to traverse
HTML text within Python. Now I actually need a web client to grab something from the Internet.
And how you do that in Python, is, actually, you would use a package called your URL lib.
And inside of URL lib, there is a module called request, and inside of that module is a function
called URL open. OK? I know it’s a lot to take in. But settle down, we’re going to do
step by step. I’m going to do a really quick import all-in-one
line kind of step. All right. So I can do from URL lib dot request. So I’m calling a
package called URL lib. If you’re on Python 2, this is a different package. It’s called
URL lib 2. So I’m a calling module within that. So notice, I’m importing only what I
need. I don’t need all of URL lib. I just need the request module. And I’m going to
import out of that. OK, URL open, the one, basically, function
that I need. And it’s going to import all the basic dependencies, as well. And I’m going
to give it a name because I don’t want to type in URL open every time. I want to say
U request, uReq for short. That’s how I tend to do things. And also, I can also modularize
the import of Beautiful Soup, as well. So I can do from BS4 import. And this is important,
capital B Beautiful, and then capital S for soup. And then I’m going to just call it as
soup. So I don’t have to call out Beautiful Soup again every time I want to use this package. And this is me working in the console. This
is me playing around. So if you want to, you can actually start typing it into a script.
So in this case, I have Sublime open. And I’m going to do a Control Shift P to open
up the command console. And then I’m going to say set syntax is equal to Python. OK.
Beautiful. So now I can do the same commands in here.
So if I just select this into the command line, hit the Enter button, that will copy
it. So that way I can paste it into my script here. OK? So there you have it, the first
two lines of this. So now I’m ready to go. So Beautiful Soup
is going to parse the HTML text, and then URL lib is actually going to grab the page
itself. But what do we want to web scrape? Well I
like graphics cards. I’m going to web scrape graphics cards off newegg.com. So some of
you might know it. It’s basically Amazon but for, basically, hardware electronics. So I’m going to type in, for example, graphics
cards. So these are a bunch of graphics cards that have shown up in my search bar. And it
would be nice to basically tabularize and turn it into a data set. And notice that,
if a new data set, if a new graphics card is introduced tomorrow, or if ratings change
tomorrow, or phrases change tomorrow, I run the script again and it updates it into, basically,
whatever it is that I loaded into. I can log into a database, a CSV file, and Excel file,
it doesn’t matter. So in this case, I’m going to grab this URL.
OK. That’s all I’m going to do. So basically I’m going to copy this URL, and I’ll pasted
into my script. So, in this case, I can do my URL is equal to– so that is the URL I
want to use of this. And in this case, I will actually run it in
my console. So when I’m web scraping, I like to also prototype it into the command line,
as well, so I know that the script is going to work. And then once I know that it works,
I will go ahead and paste that back into my Sublime. OK so this is my URL. So I’ve gone
ahead and called a variable and placed a string of the URL into it. Now this is going to be
good. So now I will actually open up my web client. So in this case, I would do U request,
right? So notice I’m calling you URL lib, and I’m
calling it from the shorthand variable that I called it earlier. So notice I called from
URL lib dot request import URL open as U request. So I’m actually calling the function called
URL open right now, inside of a module called request, inside of a package called URL lib. So the next thing is, I’m going to throw my
URL into this thing. So what this is going to do, it’s going to open up, basically, a
connection, it’s going to open up this connection, grab the web page, and basically just download
it. So it’s a client. So I’m going to call it a U client is equal to U request of my
URL. It’s going to take a while depending on your Internet connection because it’s actually
downloading the web page. I noticed that. OK it’s done. So the minute I want it, I can
do a read, a U client dot read. If I do read, it’s going to dump everything out of this
right away. I can’t reuse it. So before it gets dumped, I want to store it into something,
a variable. So I’m going to call, I guess, page underscore– since this is the raw HTML,
I’m just going to call it HTML– page HTML is equal to U client dot read. I can go ahead and show you this thing, but
it might– depending on how big the HTMO file is– I can actually crash the console. So
I’m going to show it to you once it’s inside of Beautiful Soup. Bear with me here. And
any web client, since this is an open Internet connection, I want to actually close it when
I’m done with it. So U client dot close is what I’m going to do. And knowing that all of these lines of code
have worked so far, I can just go ahead and copy them into my script. So my URL is that.
And U client is– and just add some documentation, opening up connection, grabbing the page.
OK. And then what this does is, it offloads the content into a variable. And then what
this is going to do, it’s going to close the client. Then the next thing I need to do is I need
to parse the HTML, because right now the HTML is a big jumble of text. So what I need to
do right now is I need to call the Soup function that I made earlier. So notice I called from
BS4 for import Beautiful Soup soup. So if I call soup as a function, it’s going to call
it the Beautiful Soup function within the BS4 package. So in this case, I will do soup of, basically,
my page HTML. And then if I do a comma here, I will have to tell it how to parse it, because
it could be an XML file, or, in this case, I will tell it to parse it as an HTML parse
file. And I need to store it into a variable or else it’s going to get lost. So in this
case, I’ll call it a page soup. I know it’s kind of weird that they call it
a soup, but it’s standard notation. Now, when you say soup, people understand that this
is the data type of it. It’s derived from the Beautiful Soup package. All right. So in this case this does my HTML
parsing. OK. So now, if I go to the page soup, and
I just try to look at the H1 tag, page soup dot H1, I should see the header of the page.
So this does say video cards and video devices. So I should see that somewhere. So notice
that they grab this header right here. And just just, for good measure, let’s just
see what else is in there. So Beautiful Soup dot, maybe there’s a P tag in there I can
look at. So newegg.com, a great place to buy computers. So I think that might be at the
very bottom. Great place to– actually, no, it might be something that’s hidden. It might
be just in a tagline. All right. But I am on this page. So now what
we need to do is traverse the HTML. So basically what I’m going to do is, I’m going to convert
every graphics card that I see into a line item, into a CSV we file. To do that, to traverse–
now that I have a Beautiful Soup data type, I can’t actually traverse, basically, the
dom elements of this HTML page So let me show you how to do that real quickly.
So if I inspect the element of this page, so if I go find the body tag, for example.
I think the body type– it starts off as a body. So if I do a body, page soup dot body,
and then I can keep going. I can keep going dot within the– so notice that this body
tag can go even further into an A tag or span tag. So if I type in the span tag, I should
find this span tag. Or body dot pan. See that? Span class no CSS skip to. See that? No CSS
skip to. That’s awesome. So the next thing I’m going to do, let me
just make this HTML a little bit bigger so you guys can see it even further. All right.
So what I want is if I’m in Chrome, you can also use the Firefox Firebug to inspect the
HTML elements of a page. So I’m going to just select this, the name of this graphics card
right here, and try to inspect that element. It jumps me directly to this A tag. It jumps
me directly into this A tag. And I want to grab the entire container that the graphics
card is in, because I know that graphics card container contains other goodies, such as
the original price, its sale price, its make, its review type, and the card image itself. So I go out. So since HTML is an embedded
kind of tagging language, I can go out until I find what it is that is containing all of
this. So notice that this div right here with the class of item dash container, contains
and houses all of the items inside of this thing. So basically I would need to set a
loop. I would write my script first on how to parse one graphics card, and then once
I’m done with that, I can loop through all of the class containers, and go ahead and
parse out every single graphics card into my data file. So in this class, I need this class. I want
to grab everything that has this class. So I want to go ahead and do that right now. So I want to go to– my page soup — There
is a function called find all. And it’s capital A with find all. And I want to find, what
do I want to find? I want to find all divs that have the class item dash container. So
I would go back, and I would say, find me all divs comma, and then I would feed it an
object. And the object says what is the name of the tag that you’re looking for? So it’s
a class. If it was an ID, I would put ID here. And then I would go ahead and paste in the
item that container is what it’s called. So in this case, I will feed this into a variable
called, I guess, containers. We’ll call it by what the class is. I’m going to copy this,
as well, and paste it into my script. Hopefully it works. So from this, I will grab, grabs
each product. So notice that even though I’m writing this for graphics cards, I’m betting
that Newegg has actually standardized its HTML enough so that I can actually parse any
page, any product, on Newegg, if I just run the script over. So if I call this containers, so let’s check
the length of the containers to see how many things did it find. So it found 12 objects.
So it found one, two, three, four , five, it found 12 graphics cards, basically, is
what that did. And look, there’s six of them. Yes, that is true. OK so let’s look at the first one. So if I
go to containers of the zero index, I should see HTML for this thing. So I am actually
just going to copy this out into my text file, and I’m going to read it in there, because
sometimes when you load a page, there are some post-loading loading done via JavaScript.
And some things will show up, some things won’t show up. So just to be sure, I’m just going to paste
it into my Sublime. And from my Sublime, I can go ahead and figure out what is actually
in there. So I’m going to go Control new and Sublime, paste it in. But notice, it’s not
very pretty. So we’ll deal with that in a minute. I’m going to set my syntax to become
HTML. OK it’s in HTML now. But that’s not pretty. I want to use an external service
called JS Beautifier. So it’s going to do all the spacing when there needs to be spacing.
So JS Beautifier, you basically just copy an ugly code, and it turns it pretty. See
that? Everything is all now nicely spaced and deliminated. Here we are. Now let’s read what’s actually
in this thing. So if I open this up now, I know it’s going to be a little bit hard to
read. What kind of things do we want out of this thing? If we go through, we can see that
there’s some pretty useful things. We can see that the items have ratings. It has a product name. We want to grab the
product name for sure. Let’s see, there is its brand. I can grab its brand. So notice
that they call the image the name of the brand, which is useful. So if I grab the title of
this image– Notice that the image itself, it says it says EVGA, but that’s an image,
I can grab the image. I can grab the image, I just can’t parse what it says unless I use
image recognition. But notice that the title encodes what type of brand it is for us. So
that’s very convenient. So this is something that we want to grab. And also I want to be sure I want to grab
things that are true of everything. So if not, I’m going to have to run into some corner
case if-else statements. So notice that this guy right here is special. He doesn’t have
any egg reviews. So if I wrote something to parse reviews, I’m going to need to write
an if else statement, or I’m going to do I’ll have to do a try and catch with an index out
of error catch. OK. And then notice that it doesn’t even have what this number is. I think
it’s the number of reviews here. So I’ll let you guys go ahead and handle the
scraping of that, but I’m going to scrape things that are present in all of them. Notice
that I’m going to scrape the names. All of them seem to have the names of the brand or
the names of the product. And then I’m going to go ahead and scrape the product itself.
And not all of them have a price. You see that? I have to add it to the cart to see
the price. And let’s see what else is good. And they
all seem to have shipping. So I’m going to grab shipping to see how much they all cost.
So once you learn how to scrape one, it’s the same really for all of it. Now if you
want to loop through all of it, you have to do those if else statements to catch all the
loose cases that aren’t there. So notice that if I do a container right now,
a container of zero a container of zero– going to throw container 0 into just a variable
called container. Later I’m going to do a for loop that says for every container in
containers. Right so right now I’m prototyping the loop before I want to build the loop.
So I want to make sure it works once before I even build the loop. So this container contains a single graphics
card in it. I will call it container instead of contain. So container dot, dot what? Let’s
see what is in here. Notice that container dot A will bring me this thing back. So if
I do container dot A, this brings me back exactly what I thought it would. It would
bring me the item image. So the item image, not that useful to us. Let’s see if there’s anything that we can
redeem in here. The title, we might be able to redeem the title, but it seems that we
can also grab that down here which I think this might be the more efficient way to grab
it. So let’s get it from there instead, because that’s what the customer sees. That’s what
you will see when you go and visit the space. So we will go instead of doing dot A, we will
do dot div. We’ll go jump from this A, directly into this div. So I’ll go ahead and push up, and say container
dot div. So that will jump me into this div right here, and everything inside of it. OK.
Boom. OK. So if I go into that container dot div,
I will just probably assume this is the right one. I know web scraping HTML tends to be
hard because it hurts your eyes, unless you know how to read HTML very well. But it’s
something just to get used to. So I know that I’m in this div and I want
to go into another div called item branding. So div dot div. And inside of that div there
is, I think, an A tag. This A tag actually contains some things that we want, which is
this guy right here. What is the make of this graphics card dot div dot A. And there we
have it. So here’s the H ref of the link. So what I’m
grabbing is this guy right here, this EVGA thing that I’m grabbing. Notice I hover. It’s
a clickable link. That link is this guy right here. But what I really want is this title,
the title of this link. So what do I want? I want to do container
dot a dot image. So I want to grab this image tag now. So notice I’m just using these handles.
I’m just referencing as if it was a JSON file. And notice that I’m inside of the image now.
So the image is here. Now I need to grab this title. So this is an attribute inside of the
image tag. So how do you grab an attribute? Well you would reference it as if it was an
index, or I mean, a So I would say title of this is equal to EVGA So now that I have prototyped it, I can go
ahead and add that to my script. So I can go ahead and copy this right here, and paste
that into my script. Inside of my script, this is where I actually
can do that preemptive loop now. I can write that loop now. So for container in containers.
It’s going to go loop through, and it’s going to grab container dot div dot div dot A of
that image of that title is going to equal to the brand or the make. So the that’s the
first thing I grabbed. So who makes this graphics card? That’s the
first thing it’s going to do. So what else do I want to grab while I’m inside of this
thing? So let’s grab two more things. All right. Just grab two more things just to have
a really good file, because a CSV file with one column seems a little tiny bit pointless. All right the next thing I want to do is,
I want to go ahead and grab the name of this graphics card, which is right here. Notice
that it’s embedded within this A tag, and this A tag is embedded within this div tag.
And this div tag is embedded within this div tag. In theory, if we do a container, dot
div dot div dot A, it actually brings out it seems like it brought out the item brand
instead. So the item brand is actually this A tag, which is not what we wanted. We wanted
this A tag. So notice that it’s having trouble finding
this particular A tag. So what I want to do, actually, is I want to do– I can do a Find
All, and find just the direct class that I want. So in this case, I can do a find me
all the A tags that have item dot title. So in this case, I can do container dot find
all is equal to, I want to see the A tag, comma, and then I want to throw it into an
object. And the object is, I’m going to say, look for all classes that will go ahead and
start with item title. So this will give me a data structure back
that has everything that it found. So hopefully should only be one thing so that we don’t
have to loop over it. So in this case, container equals that which would be title underscore
container. If I look at the title underscore container, I should have what I’m looking
for. Beautiful. So the name of the graphics card is somewhere
in this thing. I’m going to put this and I’m going to throw it into my script so I can
run it later. So going back– So the title container, notice this isn’t the actual title
yet. I still have to extract the title out of this thing. So in my title container — notice
that it’s inside of the bracket bracket, which means it’s inside of an array, or in this
case it’s a list if you’re in Python. So in this case, if I go to zero, I want to
grab the first object. And inside of that first object, I want to grab, nope it’s not
inside of the I tag, it’s actually a text inside of the A tag. So if I do dot text,
this should get me what I want. Yes. So I do title dot of zero dot text, and that gives
me exactly what I want. So I’m going to place that in there, and I
want to call this the title, so the product name. So product name is equal to title container
dot text. So that is that. So I’ve got the brand, the make of the graphics
card, and the name of the graphics card again. And now we can go ahead and grab shipping,
because shipping seems like something else that they might all have. So what we’re going to do is figure out where
this shipping tag is inside of all of it. How much does it cost for shipping, because
I think some of them cost differently for shipping. Yes, this is $4.99 shipping. So
in this case, I need to find all LI classes– basically, LI stands for a list– with the
class price dot dash ship. So I want to go ahead and do that. I’m going to copy this class. And I want to
do container dot find all of LI comma of class is equal to price ship. And this will give
me, hopefully, a shipping container. Shipping underscore container and, hopefully, there
should only be one tag in this thing that has shipping in it. And I need to close that
function. So my shipping underscore container, if I can just copy this, shipping container. You will see that it gives me back an array
of things that qualify. So in this case, only one thing came back. So I can do that same
thing I did earlier where I reference the first element, and then I think it’s also
in the text again, right? So I can do dot text again. And this brings me back. It looks
like there’s a lot of open space. Notice there’s a return, and then there’s
a new line. There’s a return, and then there’s a new line. So in this case, I want to clean
it up a little bit because I just want the text. So in this case I will say strip. So
strip removes whitespace before and after new lines, all that good stuff. So it just
says free shipping now. So I can go ahead and grab this, and throw it into my script,
as well. So now I’ve grabbed three things. So in this
case, I also need the find all that I did earlier. So if I go up a few times, I can
find it. So the shipping container itself will be placed in here. And then if I close,
actually, the find all function, and there we go. So now there are the three things that I want.
So the product name, the brand, and the shipping container will be actually shipping. OK. So cool. So now this is ready to be looped
through. But before that, I want to print it out. So I want to show you why is Sublime
is my favorite editor. It does multi-line editing. S in this case, I’m going to go ahead
and enter three blank lines. I’m going to copy my three variables. OK, copy, copy, copy.
I’m going to paste them in here. I’m just go ahead and make it nice and formatted. So I will print all of these things out into
the console, just so I can see. So in this case I will copy this, as well. So that way,
I can go ahead and just say quote, and then paste that. So I can see what it is when it
actually does print out. And then I can do a plus for for a string concatenation. It’s
going to print each of these three things out for me, so the brand, and the product
name, and the shipping. And basically, before I throw this into a
CSV file, I want to just make sure that this loop works. So I want to save this web scrape
thing, too. I want to call this web my first web scrape dot py. OK. So if I open this,
there should be a file here. If I right click and open up another console, so notice I have
accounts before. But this one is running Python. I want to open up this one. And I want to tell it. So notice that I’m
inside of this file path now. So this file path is a file path that contains this script
already in it. So what I need to do is just do Python. So I want tell it to run Python.
And I want tell it, OK now that I’m in Python, execute this script. So my first web scrape
dot py. Hit Enter. And then, hopefully, look at that. It went through. It did that loop.
And it grabbed every other graphics card for me. So all I have to do now is throw this into
a CSV file. And I can then open it in Excel. So let’s go ahead and do that real quick.
Just finish up our code. And I don’t really need the prototype for this, because I know
that the script works now. To open up a file, you would do just the simple
Open. And then, in this case, I need a file name. So the file name is equal to, I guess,
products dot CSV. OK so I want to open up a file name. I need to instantiate a mode.
So in this case W for write. So I want to open up a new file and write it in it. So
this would be called F. So the normal convention for a FileWriter is F. And I want to write the headers to this thing.
So, in this case, F dot write is equal to, now I need to call some headers to a CSV file
which usually has headers. In this case, headers will equal to, I think I’ll make it, brand
name, let’s call it product name, because if you load us into a SQL database later,
name is a key word in SQL. So product name, and then I’ll call this shipping. OK. And
then I also need to add a new line because CSVs are delineated by new line. So I’m going to tell it to write the first
line to be a header And then the next thing is, I want to tell it to every time to loop
through, I want to write a file. So instead of printing it to the console, which I’ll
let it do actually, I’m going to do F dot write. So F dot right is going to write so
these three things. So product, product name, shipping. I paste that in there. That’s going
to paste all three of them for me. But what I need to do is actually concatenate
them together. And I need to concatenate them with a comma in the middle. So comma. And
let me just double check something real quick. See if my strings are clean. And no it is
not. So notice that the product names have commas inside of them. So what that’s going
to do is it’s going to create extra columns inside of my CSV file. So before I print the product names out, I
actually need to do a string replace. So I need to call a replace function as every time
you see a comma, let’s replace it with something else. And I like to do a pipe, but you can
delineate it as anything you want. This is programming. You can do whatever you want
as long as it doesn’t err. In this case, I would go ahead and do that. And also, don’t
forget this, it needs to be deliminated by a new line. So every time is going to loop through, it’s
going to grab and parse all of the data points. And then it’s going to write it to a file
as a line in the file. And what I need to do is, once it’s done looping, I will have
to close to file. Because if you don’t close the file, you can’t open the file. Only one
thing can open the file at a time. All right. So I will run the script again.
So notice if I just push up, it runs the script. So you have to save the script first. I’m
going to do Control S to quickly save it. When you do control– syntax error! I forgot
to add a concatenation with the plus N. So I need to do a plus N to tell it to concatenate
that. So I go Python my first web scrape. It went through. So after running that script, it’s gone ahead
and scraped everything and printed everything to the console. But more importantly, it rewrote
everything to this file. I told it to write everything to the CSV file. So if I open it
up right now, you can see that it has gone ahead and scraped the entire page and thrown
every data point as a row, every product as a row, into this CSV file. So you can go ahead and scrape the other details,
like whether or not it is a sales price or not, what the image tag might be. And then
there’s multiple pages. So if you go to Amazon, for example, there’s multiple pages of probably
products. So you can start looping through. So usually up here, there’s a page equal something.
So you can just do a loop and just say, in this case, do page two instead of page one. And that concludes today’s lesson on how to
web scrape with Python. And I hope you guys learned a lot and had fun doing it. Now I want to really know from you guys, did
you guys enjoy this kind of video? Do you guys want more coding videos? More data science
videos? And if there’s a better way to code something, also let me know. I’m always happy
to hear from you guys. What do you guys enjoy? I want to make this content for you guys.
All right. Now I’ll see you guys later, and happy coding.

Danny Hutson

100 thoughts on “Intro to Web Scraping with Python and Beautiful Soup

  1. MINOR SUGGESTION
    As of 10/03/2019, If you are following along this tutorial. "container.div" won't give you the div with the "item-info" class. Instead it will give you the div with the "item-badges" class. This is because the latter occurs before the former. When you access any tag with the dot(.) operator, it will just return the first instance of that tag. I had a problem following this along until i figured this out. To solve this just use the "find()" method to find exactly the div which contains the information that you want. For e.g. divWithInfo = containers[0].find("div","item-info")

  2. I have a question. How to web Scraping with a lot of pagination? Please give me a solutions. Thanks.

  3. I´ve been watching your video and trying to replicate it but the problem is that the web page has a different estructure now. There is a <div> inside that first <a> so when i type container.div it does not go to the <div> below that <a> but rather shows the <div> inside that first <a> how can I tell python to show me the <div> outside that first <a> ? Thanks!

  4. Hello EveryBody why some page gives 404 Error Like this: uClient=uReq(my_url)

    Traceback (most recent call last):

    File "<stdin>", line 2, in <module>

    File "C:ProgramDataAnaconda3liburllibrequest.py", line 222, in urlopen

    return opener.open(url, data, timeout)

    File "C:ProgramDataAnaconda3liburllibrequest.py", line 531, in open

    response = meth(req, response)

    File "C:ProgramDataAnaconda3liburllibrequest.py", line 641, in http_response

    'http', request, response, code, msg, hdrs)

    File "C:ProgramDataAnaconda3liburllibrequest.py", line 569, in error

    return self._call_chain(*args)

    File "C:ProgramDataAnaconda3liburllibrequest.py", line 503, in _call_chain

    result = func(*args)

    File "C:ProgramDataAnaconda3liburllibrequest.py", line 649, in http_error_default

    raise HTTPError(req.full_url, code, msg, hdrs, fp)

    urllib.error.HTTPError: HTTP Error 404: Not Found

  5. This is very helpful. You make this technical information easy to understand. Thank you very much.

  6. Excellent, the video is really very amusing and most importantly the way you code is very excellent.
    I just fest that coding is very joyous if we can code the way u are doing.
    Really liked it and thank u a lot.

  7. I wrote this script using python/ BeautifulSoup. Scrape any crypto currency historical data by simply putting in the name. 😀
    https://github.com/gitFaisal/crypto_currency_scraper

  8. great tutorial, can you please help me how to get the data ( corn and oil relative strengths)from this web site https://finviz.com/futures.ashx?
    I greatly appreciate

  9. Does the command line have to be in a particular folder? Or must it be in the same directory as the IDE and text interprators?

  10. You are a great teacher I believe I was not much not into data Science but after watching your video it made it simple and easy. Thank you I took 100% from it. 🙂 requesting more videos from you..

  11. I am just starting web scrapping and I can honestly say that this video clearly explained everything. I watched this at 1.5 speed and it made sense. I would love more videos like this. I loved how you made it generic so it can apply to more than one website!

  12. This is more helpful if you guys want a written code to follow when scraping https://websitescrapingtutorials.wordpress.com/2019/06/23/how-to-scrape-yelp/

  13. Damn your tutorial was very good. Helped me make a CSV for all apartments in the area i want to move that accepts dogs. As a side note I had a problem with the open file, if it gives you error about encoding or something about char add this to the "open()": open(filename, "w", encoding='utf-8') . You need the encoding part and its all good.

  14. brand = container.div.div.a.img["title"]

    AttributeError: 'NoneType' object has no attribute 'a'

    Can anybody help me solve this error? Thanking you in anticipation.

  15. for container in containers:
    brand = container.div.a.img["title"]
    title_container = container.findAll("a", {"class:": "item-title"})
    product_name = title_container[0].text.strip

    when I run this I get an error saying "IndexError: list index out of range." Do you have any idea why this is happening?

  16. Excellent Explain as compare to other. Hope through this channel , my concept on Data science will clear shortly. And only this channel where i got good concept on R .

  17. the syntax pagesoup.findALL is not working, its giving me an nonetype error, but find_all is working, could anyone explain why

  18. SOURCE CODE:

    from urllib.request import urlopen as uReq

    from bs4 import BeautifulSoup as soup

    import requests

    my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20cards'

    uClient = uReq(my_url)

    page_html = uClient.read()

    uClient.close()

    page_soup = soup(page_html, 'html.parser')

    containers = page_soup.findAll('div', {"class":"item-container"})

    csv_file_export = open('product_info2.csv', 'w')

    csv_file_export.write('brand, product_name, shipping_price n')

    for container in containers:

    try:

    brand = container.find('div', 'item-info').find('div', 'item-branding').a.img["title"]

    title = container.find('a', class_ = 'item-title').text

    shipping_price = container.find('li', {'class':'price-ship'}).text.strip()

    except Exception:

    brand = 'Brand not specified'

    title = 'Title not specified'

    shipping_price = 'Shipping price not specified'

    csv_file_export.write(brand + ',' + title.replace(',', '|') + ',' + shipping_price + 'n')

  19. A one line code for the following could be this:

    uClient = uReq(my_url)

    page_html = uClient.read()

    uClient.close()

    instead:
    import requests
    page_html = requests.get(my_url).text

  20. These three lines of code are identical:

    shipping_price = container.find('li', {'class':'price-ship'}).text.strip()

    shipping_price = container.find('li', 'price-ship').text.strip()

    shipping_price = container.find('li', class_='price-ship').text.strip()

  21. Amazing stuff man. Very well explained. Do a tutorial on parsing xml and text files in python? That should be really helpful too, especially where we have to analyze log files for instance

  22. Can someone post their final script code please? I got lost because the code doesn't work anymore in this video.

  23. IT WAS SO FUCKING FRUSTRATING. I did it in Sublime text and anaconda just like he. And it didn't fucking work. i tried it in Spyder and it worked!

  24. i take its an example ,when iam doing it showing len(containers)is zero . can u pls give me the solution

  25. Thank you for the video. I followed along but ran into issues when attempting this on another site. I found a dataset of containers I wanted to use, but when I ran len(containers) it returned 0. This was trying to use findAll. I tried to to page_soup.div.div.div and navigate down to the containers with no such luck. I was wondering what I would need to do different on this site: https://www.twitch.tv/speedgaming/videos?filter=archives&sort=time

  26. I've successfully been able to set up my web scrape but my only problem is that every time I run it, it dumps the data into my excel sheet a million times. As the code runs on my command prompt it shows that the data is collected only once, so I don't see it it's collected numerous times on the excel sheet… Please help!

  27. So I'm using "containers = page_soup.find_all("div", class_='modals-container') and no matter which synthaxis I use, it's always giving me zero. Any solution?

  28. Since u installed anaconda, Y dont u use one of the many IDE´s avaliable there ?? I guess spyder is the most conftable in the set…

  29. Below is also a great article on Web Scraping using Python & Beautiful Soup and what we can do with scrapped data. Complete Source Code is also provided.
    https://www.opencodez.com/web-development/web-scraping-using-beautiful-soup-part-1.htm

  30. This was very good. I'm a beginner to Python and this webscraping tutorial left me with very little questions.

  31. when i print containers[0] only a small part of the html is shown. All the contents of the class item-info and item-branding is hidden. How can i solve this?

  32. Thank you! Your video helped me greatly on my way to learn webscraping. I analysed your tutorial in depth on my wiki: http://wiki.devliegendebrigade.nl/Webscraping#Voorbeeld_NewEgg. In the end, this is the code that I used:

    #! /usr/bin/python3
    #
    # Newegg webcrawling-example – Compact
    ###################################################################
    #
    # Load libraries
    ###################################################################
    #
    from bs4 import BeautifulSoup
    import requests

    # Fetch webpage
    ###################################################################
    #
    url = 'https://www.newegg.com/global/nl-en/p/pl?d=graphics+card'
    p_html = requests.get(url).text
    p_soup = BeautifulSoup(p_html, "html.parser")

    # Process webpage
    ###################################################################
    #
    cs = p_soup.findAll("div",{"class":"item-container"})

    i=0
    for c in cs:
    i=i+1
    print("")
    print(i)

    if (c.find(class_="item-brand") is not None):
    c_brand = c.find(class_="item-brand").img['alt']
    print ("Brand: "+c_brand)

    if (c.find(class_="item-title") is not None):
    c_name = c.find(class_="item-title").text
    print("Name: "+c_name)

    if (c.find(class_="price-current") is not None):
    c_price = c.find(class_="price-current").strong.text
    print ("Price: "+c_price)

  33. I need to scrape data from different websites so how I can I do that in this script.
    Also either any method to pass a text document to the "my_url = " So I can put all the websites in that text document and then pass it to the my_url variable to scrap from those sites and store in a csv file.

  34. well presented !
    [1] if code could be downloaded, it would be wonderful.
    [2] if Chinese character be read as it is read on the screen. If not, what codes be added.

  35. Hi, this is really cool! Absolute Legend 😀
    My_request > I would love to see a tutorial on how to scrape hotel prices for London for example. I would also like to know how to loop through dates in order to show seasonality patterns in the data.

  36. Table of Contents:
    0:00 – Introduction
    1:28 – Setting up Anaconda
    3:00 – Installing Beautiful Soup
    3:43 – Setting up urllib
    6:07 – Retrieving the Web Page
    10:47 – Evaluating Web Page
    11:27 – Converting Listings into Line Items
    16:13 – Using jsbeautiful
    16:31 – Reading Raw HTML for Items to Scrape
    18:34 – Building the Scraper
    22:11 – Using the "findAll" Function
    27:26 – Testing the Scraper
    29:07 – Creating the .csv File
    32:18 – End Result

  37. Is there a better way to find a specific item than Find / Find All? Searching text this way isn't efficient, especially if it's already been processed and extracted into an array or a list. Is there a direct way to get at each element, e.g. by the ["TagType"] or another indexed way? Also, you should show how to extract the price 🙂 Finally, there's no word "deliminated". Thanks!

  38. Excellent tutorial. Thank you so much.
    I profite to ask for helping: when you are on amazon of 1 products page , and you want to scrap page 2 , 3, 4, etc, is there a code to scrap page 2, 3, 4, etc?? or do you have to scrape one by one?… thank you for your help!

  39. ashop.findAll("p", {"class":"adr"})

    [<p class="adr"><span class="street-address">9205 Skillman St Ste 134</span><span class="locality">Dallas, </span><span>TX</span> <span>75243</span></p>]

    >>>
    how do i grab the city, state and zip from this p tag , Dallas , TX 75243 to place in separate columns
    i was able to grab the street with your method using ashop.findAll("span", {"class":"street-address"}) but also need city, state and zip from this p tag. thanks

  40. Great tutorial. Wish you've used Jupyter Notebook instead of the command line. It would be easier to follow along.

Leave a Reply

Your email address will not be published. Required fields are marked *