Data structure/format issue: Is the problem my function?

Hi there, each morning I get an email listing the sales of fruit and veg for the day.
The structure of the emails is as follows:

Date of report:,22-JAN-2022
Time report produced:,23-JAN-2022 00:11


Service: 

$$s
Apples,5
Grapes,8
Grapefruit,6
Lemons,5
Oranges,1
Pears,2
Total,27

I've recently been able to put together some code that goes through my emails, finds values in the body and puts them into a data table:

OutApp <- COMCreate("Outlook.Application")
outlookNameSpace = OutApp$GetNameSpace("MAPI")

search <- OutApp$AdvancedSearch("Inbox", "urn:schemas:httpmail:subject = 'FRUIT SALES'")
Sys.sleep(5)
results <- search$Results()
results$Count()

get_vals <- function(email) {
  body <- email$body()
  date <- str_extract(body, "\\d{2}-[:alpha:]{3}-\\d{4}") %>% 
    as.character()
  data <- read.table(text = body, sep = ",", skip = 10, strip.white = T) %>% 
    row_to_names(1) %>% 
    mutate("Date" = date)

  return(data)
}
info <- sapply(results, get_vals, simplify = F) %>% 
  bind_rows()

print(info)

If I write 'info' to excel, I end up with a s/s with the following format (| represents division between A and B column, Service and Date and emboldened at the top of the s/s):

Service| Date
$$s | 22-JAN-2022
5 | 22-JAN-2022
8 | 22-JAN-2022
6 | 22-JAN-2022
5 | 22-JAN-2022
1 | 22-JAN-2022
2 | 22-JAN-2022
27 | 22-JAN-2022

What I'd like to do is have the data structured differently so that it doesn't drop the fruit it relates to. I'd like to achieve this format (dropping the $$s entirely):

Date | Apples | Grapes | Grapefruit | Lemons | Oranges | Pears | Total
 22-JAN-2022| 5 | 8 | 6 | 5 | 1 | 2 | 27

Can anyone advise on how to achieve this result?

I made a text file containing the data sample you posted and modified your code to get the desired result. I expect the object returned by readLines is not exactly like the contents of email$body in your version but I hope this helps you. Notice I changed the skip argument to 7, which might be caused by copy and paste errors. I also had to select the first element of the vector returned by str_extract.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
rawText <- readLines("~/R/Play/Dummy.csv")
#Following two lines used to see the result of read.table
#TEXT <- read.table(text=rawText, sep=",", skip=7, strip.white = TRUE)
#TEXT

get_vals <- function(body) {
  #body <- email$body()
  date <- str_extract(body, "\\d{2}-[:alpha:]{3}-\\d{4}")[1] %>% 
    as.character()
  data <- read.table(text = body, sep = ",", skip = 7, strip.white = T) %>% 
    #row_to_names(1) %>% #where does row_to_names come from?
    mutate("Date" = date) %>%
    pivot_wider(names_from = "V1", values_from = "V2")
  
  return(data)
}
get_vals(rawText)
#> # A tibble: 1 x 8
#>   Date        Apples Grapes Grapefruit Lemons Oranges Pears Total
#>   <chr>        <int>  <int>      <int>  <int>   <int> <int> <int>
#> 1 22-JAN-2022      5      8          6      5       1     2    27

Created on 2022-02-01 by the reprex package (v2.0.1)

Hi there, thanks very much for your advice thus far.

The additions you've made have certainly moved everything forward to extent. However, unfortunately the format of the output is still pretty much unchanged. If it's any assistance, when I print I get the following in the console (even with pivot_wider(names_from = V1, values_from = V2):

$$s                                Date
Apples...1                         5 22-SEP-2021
Grapes...2                         8 22-SEP-2021
Grapefruit...3                     6 22-SEP-2021
Lemons...4                         5 22-SEP-2021
Oranges...5                        1 22-SEP-2021
Pears...6                          2 22-SEP-2021
Total...7                          27 22-SEP-2021

Thanks for everything thus far

The quickest way to get a good result is probably for me to have a sample of your data. It seems that results is a list. Could you post the output of something like

dput(results[[1]]$body)

I would then have a copy of the object produced in the function by

body <- email$body

Any element from results will work, it does not have to be the first one.

There is a simple way to process the output you are getting already into the desired form but it would be better to properly handle the data instead of getting a bad intermediate result and then fixing it with more code.

The results from "dput(results[[1]]$body)" is:

{
    .COM(x, name, ...)
}

Which seems a bit strange to me.

Also what seems strange is that I can produce the spreadsheet while having the "body <- email$body()" commented out.

Well, that gave me a laugh this morning! I didn't expect that at all. I just noticed that the line in your code is not body <- email$body but body <- email$body(), meaning that the body element of email is a function. Try running

dput(email)

and

dput(email$body())

and post one or both of those, depending on what to make sense.

I'm sorry I am floundering on this problem. I have never worked with this Outlook interface you are using.

dput(email$body()) delivers the following:

**********************\r\n\r\n\r\nDate of report:,22-JAN-2022\r\nTime report produced:,23-JAN-2022 00:11\r\n\r\n\r\nService: \r\n\r\n$$s\r\nApples,5\r\nGrapes,8\r\nGrapefruit,6\r\nLemons,5\r\nOranges,1\r\nPears,2\r\nTotal,27\r\n

I processed the string you posted for dput(email$body()) and my previous code works with some minor tweaks.

library(stringr)
library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
EmailBody <- "**********************\r\n\r\n\r\nDate of report:,22-JAN-2022\r\nTime report produced:,23-JAN-2022 00:11\r\n\r\n\r\nService: \r\n\r\n$$s\r\nApples,5\r\nGrapes,8\r\nGrapefruit,6\r\nLemons,5\r\nOranges,1\r\nPears,2\r\nTotal,27\r\n"

get_vals <- function(body) {
  date <- str_extract(body, "\\d{2}-[:alpha:]{3}-\\d{4}")
  data <- read.table(text = body, sep = ",", skip = 10, strip.white = T) %>% 
    mutate("Date" = date) %>%
    pivot_wider(names_from = "V1", values_from = "V2")
  
  return(data)
}

get_vals(EmailBody)
#> # A tibble: 1 x 8
#>   Date        Apples Grapes Grapefruit Lemons Oranges Pears Total
#>   <chr>        <int>  <int>      <int>  <int>   <int> <int> <int>
#> 1 22-JAN-2022      5      8          6      5       1     2    27

Created on 2022-02-02 by the reprex package (v2.0.1)

Based on that get_vals(results[[1]]$body() should also work with the above version of get_vals(). Does it?

1 Like

I feel we're making some progress possibly, but I've got a new error (wonder if the dates are throwing it off):

Error in as.character.default(new("COMIDispatch", ref = <pointer: 0x000001ae678deeb8>)) : 
  no method for coercing this S4 class to a vector 

Thanks for your time looking into this :slight_smile:

That error means some function is receiving an S4 object and it doesn't know how to process that. I would debug it by setting body <- Whatever_you_are_passing_to_get_vals() and then stepping through the lines of get_vals() individually, running each piece between pipes (%>%). Find which step is throwing the error and check what kind of object the previous step is returning.

1 Like

I was able to fix the error, and now the format is almost exactly as I wanted it (thanks again)

I've made some small changes to the function (fill = T was necessary to avoid an error):

get_vals <- function(body) {
  date <- str_extract(body, "\\d{2}-[:alpha:]{3}-\\d{4}")[1]
  data <- read.table(text = body, sep = ",", skip = 7, strip.white = T, fill = T) %>% 
    mutate("Date" = dmy(date)) %>% 
    pivot_wider(names_from = "V1", values_from = "V2")
  
  
  return(data)
}

The code now outputs this and no longer pulls data from all the emails:

Date	Date of report:	Time report produced:	Service:	$	Apples	Grapes	Grapefruit	Lemons	Oranges	Pears	Total
22-Jan-22	22-Jan-22	23/01/2022 00:11		            $s	5	    8	    6	        5	     1       	2	27

If you've got any further advice, I'd greatly appreciate it.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.