Copyright/license info for the IBM Telco churn dataset

wlandau · August 27, 2020, 6:19pm

I would like to use the IBM Telco customer churn dataset for educational purposes in the industry sector, but I am having trouble finding the license and copyright info. Does anyone know where I can find it? The dataset seems ubiquitous in machine learning circles.

wlandau · August 29, 2020, 3:48am

IBM's official page on the Telco churn dataset does not appear to mention license info: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

jrk · August 29, 2020, 4:01am

Pure data cannot be copyrighted:
https://libguides.library.kent.edu/data-management/copyright#:~:text=Data%20are%20considered%20"facts"%20under,not%20created%20as%20original%20works.&text=Although%20data%20itself%20cannot%20be,the%20compilation%20of%20the%20data.

wlandau · August 29, 2020, 4:54am

Interesting. So since the data file is publicly available and ubiquitous on Kaggle etc. I can assume it is in the public domain?

jrk · August 29, 2020, 1:53pm

As I tell my students when I give my presentation on web scraping:

I am not a lawyer, and more importantly I am not your lawyer...

That said, my understanding of the issue is this.

Raw data cannot be protected by copyright, though there may exist other protections, e.g. if you pay for a membership to a website which holds data, you may need to agree to terms to which you could be legally bound (generic terms of service you do not explicitly agree to would generally not be sufficiently binding).

Only the creative expression of data can be protected by copyright. The canonical example (and to the best of my understanding the case law which established the precedent is a directory of telephone numbers).

The names and numbers themselves are simply data with zero creative merit. Printing them on a page, having made decisions about font type and size, the page layout, and other things constitutes some modicum of creative expression.

So, you could, in theory, scan and OCR a telephone book (or manually re-key it as was likely done in the original case), and use that data to print your own creative expression of a telephone book. What you could not do is photocopy the pages of someone else's telephone book and sell copies.

I hope that clarifies things.

Again, this is all to my best understanding, and I welcome all corrections.

Lastly,

I am not a lawyer, and more importantly I am not your lawyer.

Best.

wlandau · August 29, 2020, 3:45pm

Thanks, that helps me understand the general issue.

Over the past several days, I have been trying to contact IBM for solid answers and confirmation about the Telco churn dataset specifically. So far, it has been a gratuitously long corporate runaround.

fcas80 · August 29, 2020, 3:58pm

I am also not a lawyer, but I have done some studying of copyright. I basically agree with jrk that data itself is not copyrightable. Company data might have a different problem - I can't just take my company customer list and post it on Kaggle because this is company property, but presumably the dataset in question is not in the same category. That Telco data does not have identifiable customer names.

The federal copyright act has special protection for educators, but you mentioned you are in the industrial sector, so I would not rely on those protections. However, do you have a corporate attorney you can chat with, just to be super safe?

wlandau · August 29, 2020, 4:58pm

Thanks for chiming in. I think I can track down a copyright attorney where I work.

jrk · August 29, 2020, 9:33pm

A quick Google search for telco churn dataset license landed me at this IBM GitHub page:

Note: This license is specifically for the AI code patterns in the repository (the code) rather than the data, which I believe is again, because data is typically not afforded copyright protection.

Since this is the official IBM GitHub and they share the data with no mention of a license, it is reasonable to assume the data is license free (even if you believed data could have protections). Worst case scenario would be it is covered by the same license as the code which is the super permissive Apache license.

I personally wouldn't even bother trying to track down a lawyer, but, you should do whatever due diligence you feel necessary.

wlandau · August 29, 2020, 10:24pm

Awesome, thanks so much for tracking down an official IBM repo! I will still try to follow up in person, but this really puts me at ease.

nfultz · September 2, 2020, 3:58pm

Just chiming in to also add that copyright probably doesn't apply in this case, but depending on the country, there may be special protections given to data bases / data sets.

See https://en.wikipedia.org/wiki/Database_right for a quick description of the UK / EU.

I think your usage is probably fine, but if you are still concerned, I would recommend just providing your students a link to the data set, rather than hosting your own copy for them.

system · September 9, 2020, 3:59pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.