Not really into the technicalities but I'm wondering if it's accurate to say PDF is a text format.
I suppose it depends on what you mean by text format. Let's put it this way. You can open a PDF file with Notepad and see and interpret all the formatting codes. They are standard ASCII characters (well, probably UTF-8 characters these days). You can also see visually where there are embedded images because the embedded images look like a stream of Greek in Notepad. The "stream of Greek" is actually pure binary stuff which encodes the images. In the earliest days of PDF files, techie people would sometimes create simple PDF files by hand using text editors like Notepad, although Notepad itself did not yet exist. I was such a person who created PDF files by hand. You obviously couldn't create embedded images with a simple text editor. But you could create the text and control portions of a PDF file by hand using a simple text editor.
By contrast, an HTML file is also a text format. You can open an HTML file with Notepad and see and interpret all the formatting codes. But there are no embedded images. Instead, there are links to the images. Therefore, you will see no "streams of Greek" if you open an HTML file with Notepad.
Other examples abound, and I suppose there are some gray areas about what is and what is not a text file. You can open a JPG file or a PNG file or a TIF file or a GIF file with Notepad or any simple text editor. You will see a "stream of Greek" and no meaningful text because the encoding is all binary. A really smart text editor might be able to interpret some of the formatting codes for you, or at least show you the "stream of Greek" in hex. But what you have is a bunch of binary codes which encode the pixels or the vectors and there is no text.
Where you might get into the gray areas is with some rich text file formats. An RTF file which works fine with Microsoft Word is perfectly legible with Notepad because all text and all the formatting codes are ASCII/UTF-8 characters. All the formatting codes make sense to a human reader. Except that an RTF file is like a PDF file in the ability of having embedded images. These embedded images look like a "stream of Greek" in Notepad because they are binary codes which encode the pixels and the vectors.
DOC and DOCX files (native Microsoft Word files) are a little tricky. I definitely think of them as text files, albeit as rich text files. DOC files are an older Microsoft Word format. The text is legible in Notepad, surely because the text is ASCII/UTF-8. But the formatting codes are binary and look like "streams of Greek" in Notepad. DOCX files are the current Microsoft Word format. The entirety of DOCX files look like streams of Greek in Notepad. I suspect that's because the text in DOCX files are in a richer version of UNICODE than UTF-8 and Notepad only supports ASCII/UTF-8. UTF-8 gets you English and most Western European Language, but UTF-8 does not get you things like Greek (real Greek!), Russian, Arabic, Hebrew, Chinese, Japanese, etc. So I think modern Microsoft Word is using a more advanced version of UNICODE than UTF-8. I suspect that a text editor supporting the more advanced version of UNICODE would allow you to see the text in DOCX files, although you still would not be able to see the binary formatting codes.