How to separate title from desc (scraping data IMDB-Coming Soon Movie)

scraping

#1

i have a problem with import data form imdb and export data to excel with right way
and this is my code

comingsoon = 'https://www.imdb.com/movies-coming-soon/?ref_=nv_mv_cs_4'

webpage = read_html(comingsoon)
#RANK DATA
datacoming = html_nodes(webpage, '.nm-title-overview-widget-layout')
datacomingg = html_text(datacoming)
datacomingg = gsub("\n","",datacomingg)#remove\n
datacomingg = gsub(" ","",datacomingg)# remove space
datacomingg = gsub(",.*","",datacomingg) #remove ,.
datacomingg<-as.factor(datacomingg)
head(datacomingg)

data = data.frame(datacomingg)

write.xlsx(data,'D:/coming_soon.xlsx')

but my export data to excel is not nice

thank you for helping me


#2

You can use rvest and tidyverse tools to create the table you want directly in R before exporting it in a file.
Use css selectors to get exactly the item you want from the page. Developer’s tool on a navigator (F12) or SelectorGadget can help you.

Here is how you can create a table with the useful information

library(tidyverse)
# not in the core tidyverse
library(rvest)
#> Le chargement a nécessité le package : xml2
#> 
#> Attachement du package : 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

url = "https://www.imdb.com/movies-coming-soon/?ref_=nv_mv_cs_4"

# get list of film coming soon
coming_soon <- url %>%
  read_html() %>%
  html_nodes(".list_item")

# create a table to contain information: one line per film
coming_movies <- tibble::tibble(
  # get the title (unique)
  title = coming_soon %>% 
    html_node(".overview-top h4[itemprop='name'] a") %>% 
    html_text() %>%
    str_trim(),
  # get the genre (several per film)
  genre = coming_soon %>%
    # use purrr::map to get one list per film (otherwise html_nodes gets you a vector too big)
    map(~ html_nodes(.x, ".cert-runtime-genre span[itemprop='genre']") %>% 
          html_text()),
  # get time in min of the film if any
  time_in_min = coming_soon %>%
    html_node("time") %>%
    html_text() %>%
    # parse the number
    parse_number() %>%
    as.integer(),
  # get the description (unique)
  description = coming_soon %>%
    html_node(".outline[itemprop='description']") %>%
    html_text() %>%
    # trim whitespace and newlines on both sides
    str_trim(),
  # get the directors (several possible per film)
  director = coming_soon %>%
    map(~ html_nodes(.x, ".txt-block span[itemprop='director'] span[itemprop='name'] a") %>% html_text()),
  # get the starring actos (several possible per film)
  stars = coming_soon %>%
    map(~ html_nodes(.x, ".txt-block span[itemprop='actors'] span[itemprop='name'] a") %>% html_text())
) %>%
  # extract year from the title
  mutate(
    year = str_extract(title, "\\(\\d{4}\\)") %>%str_remove_all("[\\(\\)]"),
    title = str_remove(title, "\\(\\d{4}\\)$") %>% str_trim()
  )
coming_movies
#> # A tibble: 41 x 7
#>    title   genre  time_in_min description             director stars year 
#>    <chr>   <list>       <int> <chr>                   <list>   <lis> <chr>
#>  1 Un Rac~ <chr ~          NA After the disappearanc~ <chr [1~ <chr~ 2018 
#>  2 The St~ <chr ~          NA A family staying in a ~ <chr [1~ <chr~ 2018 
#>  3 Hurric~ <chr ~         100 Thieves attempt a mass~ <chr [1~ <chr~ 2018 
#>  4 Gringo  <chr ~         110 GRINGO, a dark comedy ~ <chr [1~ <chr~ 2018 
#>  5 Thorou~ <chr ~          92 Two upper-class teenag~ <chr [1~ <chr~ 2017 
#>  6 L'écha~ <chr ~         112 A runaway couple go on~ <chr [1~ <chr~ 2017 
#>  7 Leanin~ <chr ~          93 Leaning into the Wind ~ <chr [1~ <chr~ 2017 
#>  8 Tomb R~ <chr ~          NA Lara Croft, the fierce~ <chr [1~ <chr~ 2018 
#>  9 Love, ~ <chr ~         109 Everyone deserves a gr~ <chr [1~ <chr~ 2018 
#> 10 Entebbe <chr ~         106 Inspired by the true e~ <chr [1~ <chr~ 2018 
#> # ... with 31 more rows

You’ll get a table with some list column containing charater vectors. You can either

  1. manipulate in R list columns with purrr helping
  2. unnest the table as necessary (but all list column are not the same length),
  3. paste the characters together
# exemple for choice 2
coming_movies %>%
  modify_if(is.list, ~ map_chr(.x, paste, collapse = ","))
#> # A tibble: 41 x 7
#>    title   genre   time_in_min description          director stars   year 
#>    <chr>   <chr>         <int> <chr>                <chr>    <chr>   <chr>
#>  1 Un Rac~ Advent~          NA After the disappear~ Ava DuV~ Gugu M~ 2018 
#>  2 The St~ Horror           NA A family staying in~ Johanne~ Christ~ 2018 
#>  3 Hurric~ Action~         100 Thieves attempt a m~ Rob Coh~ Toby K~ 2018 
#>  4 Gringo  Action~         110 GRINGO, a dark come~ Nash Ed~ Joel E~ 2018 
#>  5 Thorou~ Drama,~          92 Two upper-class tee~ Cory Fi~ Anya T~ 2017 
#>  6 L'écha~ Advent~         112 A runaway couple go~ Paolo V~ Helen ~ 2017 
#>  7 Leanin~ Docume~          93 Leaning into the Wi~ Thomas ~ Andy G~ 2017 
#>  8 Tomb R~ Action~          NA Lara Croft, the fie~ Roar Ut~ Alicia~ 2018 
#>  9 Love, ~ Comedy~         109 Everyone deserves a~ Greg Be~ Nick R~ 2018 
#> 10 Entebbe Crime,~         106 Inspired by the tru~ José Pa~ Daniel~ 2018 
#> # ... with 31 more rows

Hope this example helps
Created on 2018-02-27 by the reprex package (v0.2.0).


#3

thank you sir, but when i run your code there’s error in your code

Error in mutate_impl(.data, dots) : 
  Evaluation error: could not find function "str_remove_all".

i don’t know why this’s happen

thanks


#4

You need to load stringr:
library(stringr)

This would normally load with tidyverse, but you may have an older version.


#5

thank you sir, but the error’s change

Error in eval(lhs, parent, parent) : object ‘coming_soon’ not found

i don’t know how to resolve this ??

thanks


#6

i tryng to running it, but it’s still error, please help me sir

thanks for your help


#7

Sorry, I did not precise it but str_remove_all is a new function in the last version of stringr.

You need to install the last CRAN version of stringr if you do not have it: install.packages("stringr")

As I posted a reprex, if you have a clean session and copy paste the code, it should work. coming_soon is assigned in

coming_soon <- url %>%
  read_html() %>%
  html_nodes(".list_item")

so if you execute these lines, you should have the object in your environment.


#8

okey try to run this code from you

library(tidyverse)
library(rvest)
library(reprex)
library(stringr)


url = "https://www.imdb.com/movies-coming-soon/?ref_=nv_mv_cs_4"

# get list of film coming soon
coming_soon <- url %>%
  read_html() %>%
  html_nodes(".list_item")

# create a table to contain information: one line per film
coming_movies <- tibble::tibble(
  
  # get the title (unique)
  title = coming_soon %>% 
    html_node(".overview-top h4[itemprop='name'] a") %>% 
    html_text() %>%
    str_trim(),
  
  # get the genre (several per film)
  genre = coming_soon %>%
    # use purrr::map to get one list per film (otherwise html_nodes gets you a vector too big)
    
    map(~ html_nodes(.x, ".cert-runtime-genre span[itemprop='genre']") %>% 
          html_text()),
  
  # get time in min of the film if any
  time_in_min = coming_soon %>%
    html_node("time") %>%
    html_text() %>%
    
  # parse the number
    parse_number() %>%
    as.integer(),
  
  # get the description (unique)
  description = coming_soon %>%
    html_node(".outline[itemprop='description']") %>%
    html_text() %>%
    
  # trim whitespace and newlines on both sides
    str_trim(),
  
  # get the directors (several possible per film)
  director = coming_soon %>%
    map(~ html_nodes(.x, ".txt-block span[itemprop='director'] span[itemprop='name'] a") %>% html_text()),
  
  # get the starring actos (several possible per film)
  stars = coming_soon %>%
    map(~ html_nodes(.x, ".txt-block span[itemprop='actors'] span[itemprop='name'] a") %>% html_text())) %>%
  
  # extract year from the title
  mutate(
    year = str_extract(title, "\\(\\d{4}\\)") %>%str_remove_all("[\\(\\)]"),
    title = str_remove(title, "\\(\\d{4}\\)$") %>% str_trim()
  )

and still error, and this error is

Error in mutate_impl(.data, dots) : 
  Evaluation error: could not find function "str_remove_all".
Error: object 'coming_movies' not found

thank you for help me sir, and please help me again for my understanding


#9

As mentionned earlier, have you installed stringr version 1.3.0 ? Can you check what version you have ? Thanks.

If you have stringr < 1.3.0, it won’t work. So, if you can’t or don’t want to update, just replace the last mutate

mutate(
year = str_extract(title, “\(\d{4}\)”) %>%str_remove_all("[\(\)]"),
title = str_remove(title, “\(\d{4}\)$”) %>% str_trim()
)

by this one

  mutate(
    year = str_extract(title, "\\(\\d{4}\\)") %>%str_replace_all("[\\(\\)]", ""),
    title = str_replace(title, "\\(\\d{4}\\)$", "") %>% str_trim()
  )

Is this ok for you ?

Some comments about posting here and helping us help you:
Try to take care of the style of your code. Currently, your last answer is unreadable and not useful because code is not highlited properly.
In the answer box edition, you can select some text an click on this button to transform the block to code syntax:
image
You can check what you do in the preview on the left.
Also,

Thanks.


#10

thanks sir, my problem is solved


#11

Glad it is ok for you ! Please, Can you mark your topic as solved ?

Thanks !


#12

how to mark it sir ?


#13

Did you open the FAQ I link you to in my previous answer ?


#14

thank you for help me, in this case solution


#15

Not sure your marked the correct message as solution, but your topic appears now as solved. Thanks!

Glad I could help.


#16

thanks sir cristopher, because your help my problem is clear now

thank you so much