Merging two data frames while looking up specific text

First time in the forums, but basically i've been looking for a way to merge or create a new dataframe while doing a loop to look up for specific words.

Libraries used:
data.table
tidyverse
datasets
readxl

Below is an example of the data im using and what im trying to do:

This is the main dataframe from which im trying to do a "VLookUp"
Numbers_and_Text
1 流動資産
2 商品 1,115,181
3 流動資産合計 2,899,233
4 建物及び構築物(純額 ※1、※2 427,281
5 機械装置及び運搬具(純額 ※1 77,015
6 工具、器具及び備品(純額 ※1 11,897

Into the second dataframe which is acting as a dictionary:

A tibble: 6 x 3

日本語 English Espanol

1 投資活動によるキャッシュ・フロー Cash Flow from investing activities Flujos de caja de las acti~
2 財務活動によるキャッシュ・フロ Cash Flows from financing activities Flujos de caja de las acti~
3 営業活動によるキャッシュ・フロ Cash Flows from operating activities Flujos de efectivo por act~
4 出資金 Capital Capital
5 長期借入金の返済による支出 Repayment of Long term debt Reembolso de la deuda a la~
6 有形固定資産の取得による支出 Payments for purchase of property, plant and equipment Pagos por compra de propie~

The purpose of the second dataframe is to act as a dictionary, then look through the dictionary on to the first dataframe to then extract only the words im looking for into a new dataframe.

Select only the languages I want to use in the dictionary

jp <- language %>% select(日本語)

en <- language %>% select(English)

txt_joint1 <- merge(data, jp, by.x = "Numbers_and_Text", by.y = "日本語")

view(txt_joint1)

Doing this only gets me certain words and not all the words in the dictionary.

I've tried using the "str_replace_all" and putting all the words into a vector but its not working.

Is this the correct way to basically do a VLookUp from two dataframes and then putting the results into a new one? Is doing a loop a better and more efficient way?

Thanks and hope someone can help me with this.

Hi @Mago,
Your input data is quite "messy" so a direct string match is not going to be very helpful.
I tried "fuzzy matching" which gave a reasonable result - see what you think.
p.s. I can't read Japanese so cannot check if the matching makes sense.

library(fuzzyjoin)
library(stringr)
suppressPackageStartupMessages(library(tidyverse))

jp <- read.delim(header=TRUE, text=c("
流動資産
商品 1,115,181
流動資産合計 2,899,233
建物及び構築物(純額 ※1、※2 427,281
機械装置及び運搬具(純額 ※1 77,015
工具、器具及び備品(純額 ※1 11,897"))

# Remove the number characters from the strings to aid matching
jp$new_text <- gsub(pattern="[0-9,[:space:]]", 
                  replacement="",
                  x=jp$流動資産) 
           
jp$new_numbers <- parse_number(jp$流動資産)
jp
#>                                流動資産                     new_text
#> 1                        商品 1,115,181                         商品
#> 2                流動資産合計 2,899,233                 流動資産合計
#> 3 建物及び構築物(純額 ※1、※2 427,281 建物及び構築物(純額※1、※2
#> 4   機械装置及び運搬具(純額 ※1 77,015  機械装置及び運搬具(純額※1
#> 5   工具、器具及び備品(純額 ※1 11,897  工具、器具及び備品(純額※1
#>   new_numbers
#> 1     1115181
#> 2     2899233
#> 3      427281
#> 4       77015
#> 5       11897

data <- read.csv(header=TRUE, sep=";", text=c("
日本語; English; Espanol
投資活動によるキャッシュ・フロー; Cash Flow from investing activities; Flujos de caja de las acti
財務活動によるキャッシュ・フロ; Cash Flows from financing activities; Flujos de caja de las acti
営業活動によるキャッシュ・フロ; Cash Flows from operating activities; Flujos de efectivo por act
出資金; Capital; Capital
長期借入金の返済による支出; Repayment of Long term debt; Reembolso de la deuda a la
有形固定資産の取得による支出; Payments for purchase of property, plant and equipment; Pagos por compra de propie"))

# Work with Japanese characters only
data_jp <- data[-c(2,3)]

# The arguments 'method=' and 'max_dist=' can be modified to change the
# matching criteria
combined <- stringdist_join(data_jp, jp, by=c("日本語" = "new_text"), 
                mode="full", method="osa", max_dist=15,
                ignore_case=TRUE, distance_col="Distance") 
  
combined %>% 
  distinct(new_numbers, .keep_all=TRUE)
#>                             日本語                              流動資産
#> 1 投資活動によるキャッシュ・フロー                流動資産合計 2,899,233
#> 2   財務活動によるキャッシュ・フロ                        商品 1,115,181
#> 3   財務活動によるキャッシュ・フロ 建物及び構築物(純額 ※1、※2 427,281
#> 4   財務活動によるキャッシュ・フロ   機械装置及び運搬具(純額 ※1 77,015
#> 5   財務活動によるキャッシュ・フロ   工具、器具及び備品(純額 ※1 11,897
#>                       new_text new_numbers Distance
#> 1                 流動資産合計     2899233       15
#> 2                         商品     1115181       15
#> 3 建物及び構築物(純額※1、※2      427281       15
#> 4  機械装置及び運搬具(純額※1       77015       15
#> 5  工具、器具及び備品(純額※1       11897       15

Created on 2022-08-24 by the reprex package (v2.0.1)

I hope this is helpful.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.