Privacy Plus+ 

 Privacy, Technology and Perspective 

All the Talk About Chat-GPT.  This week, we’re focusing on the latest news about OpenAI’s Chat GPT and also examining related IP and privacy issues associated with Chat GPT-4’s training data.   

Italy Blocks GPT:  The Italian Data Protection Authority (DPA) is temporarily blocking ChatGPT over data privacy concerns. The DPA faults OpenAI for (1) having no legal basis to justify "the mass collection and storage of personal data for the purpose of 'training' the algorithms underlying the operation of the platform"  (Article 6 of the General Data Protection Regulation (GDPR) requires at least one of several legal bases for processing personal data); (2) processing “inaccurate personal data” (Article 5 of the GDPR requires, among other things, that personal data be accurate), and (3) lacking an age verification mechanism, which “exposes children to receiving responses that are absolutely inappropriate to their age and awareness” (Article 8 of the GDPR requires certain conditions applicable to children's consent to the processing of the personal data). A link to the DPA’s statement follows (scroll down the page to find the English translation): 

https://www.garanteprivacy.it/web/guest/home/docweb/-/docweb-display/docweb/9870847  

Elon Musk and Others Call for a Freeze of GPT:  A number of tech leaders, including Elon Musk, have signed a petition calling for a pause on the training of powerful AI tools, and warning of the "profound risks to society and humanity" that AI systems with human-competitive intelligence can provoke. A link to their petition follows: 

https://futureoflife.org/open-letter/pause-giant-ai-experiments/ 

We encourage you to read it, and particularly consider the runaway consequences of iterative self-training methods.   

For an entertaining take, you can listen to the following edition of the All-In podcast: 

https://podcasts.apple.com/us/podcast/all-in-with-chamath-jason-sacks-friedberg/id1502871393?i=1000606724165 

For an existential warning that “Pausing AI Developments Isn't Enough. We Need to Shut it All Down,” we commend to you the following article from Time Magazine: 

https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/ 

If you don’t have time to read it, here is a haunting excerpt: 

To visualize a hostile superhuman AI, don’t imagine a lifeless book-smart thinker dwelling inside the internet and sending ill-intentioned emails. Visualize an entire alien civilization, thinking at millions of times human speeds, initially confined to computers—in a world of creatures that are, from its perspective, very stupid and very slow. A sufficiently intelligent AI won’t stay confined to computers for long. In today’s world you can email DNA strings to laboratories that will produce proteins on demand, allowing an AI initially confined to the internet to build artificial life forms or bootstrap straight to postbiological molecular manufacturing. 

If somebody builds a too-powerful AI, under present conditions, I expect that every single member of the human species and all biological life on Earth dies shortly thereafter. 

GPT-4 Offers a Confession Regarding Its Own IP and Privacy Issues. Finally, as we were considering the implications of the lack of transparency associated with the training data that underlies artificial intelligence (AI) systems, we had a conversation about this topic with Chat GPT-4 itself.   

During that conversation, GPT-4 shocked us by confessing that its training data is not made publicly available, in part, to shield its creator, OpenAI, from legal claims and scrutiny of its privacy practices.  Let’s just quote what GPT-4 said, edited for clarity and relevance, and highlighted in bold for convenience (our emphases):  

GPT-4's training process consists of several key steps: 

Data Collection: GPT-4 is pretrained on a diverse range of text from books, articles, and websites. OpenAI collects data while adhering to legal requirements and ethical guidelines. 

1. Tokenization: The text data is tokenized, converting it into a format that the model can process and learn from. 

2. Pretraining: The model undergoes unsupervised pretraining on the tokenized text, learning patterns and structures in the data. 

3. Fine-tuning: GPT-4 is fine-tuned on a narrower dataset using supervised learning, which enables the model to generalize better and generate appropriate responses. 

4. Regularization and Hyperparameter Optimization: Techniques like regularization and hyperparameter optimization are applied to prevent overfitting and improve the model's performance. 

5. Evaluation and Iteration: The model is evaluated using various metrics, and the training process is iterated to refine and improve GPT-4's performance.  

The exact list of books, articles, and websites used in training GPT-4 is not publicly available… 

The exact list of books, articles, and websites used in training GPT-4 is not publicly available for a few reasons: 

1. Intellectual Property and Copyright: Sharing the full list of sources might raise legal concerns related to intellectual property and copyright. By not providing the exact list, OpenAI can avoid potential legal disputes and infringement issues.   

2. Scale and Complexity: The dataset used to train GPT-4 is massive, containing billions of tokens from diverse sources. It would be challenging to compile, manage, and share a comprehensive list of every source used, especially considering that the data might have been cleaned, preprocessed, and transformed during the training process. 

3. Privacy and Security: Some sources in the training data might contain sensitive information, and making the full list public could raise privacy and security concerns. By keeping the list undisclosed, OpenAI can better protect the privacy of individuals and organizations whose data might be included in the dataset. 

4. Competitive Advantage: OpenAI invests significant resources in curating and developing these large-scale datasets. Keeping the specifics of the dataset proprietary can help maintain a competitive advantage over other organizations working on similar AI models…

Our thoughts:  We need technology that serves us, not the other way around.  Game theory establishes that a “pause” on the training of iterative AI tools won’t be honored.  But could it be better than nothing? 

We’re facing a big, scary problem, and we must do more than hope that AI doesn’t weaponize itself against humanity or otherwise run-away in a manner adverse to us.  While data privacy implications of AI systems are profound, they pale in comparison. 

--- 

Hosch & Morris, PLLC is a boutique law firm dedicated to data privacy and protection, cybersecurity, the Internet and technology. Open the Future℠. 

 

 

Previous
Previous

Joint IP Ownership: Addressing Challenges and Deallocating Rights

Next
Next

Kate Morris to Speak at HYFN Blockchain Transaction Conference