If you merge the three versions of DataSet 9 that are found so far:

DataSet%209.zip : https://github.com/yung-megafone/Epstein-Files

Data Set 9.tar.xz : https://archive.org/details/data-set-9.tar.xz

dataset9-more-complete.tar.zst : https://github.com/yung-megafone/Epstein-Files

You will end up with 531,282 IMAGES files (PDF). You would think that there is a lot missing, however, the partially corrupted DataSet%209.zip gives us a DAT and OPT file to see what files remain.

The DAT file reveals there are only 531,307 IMAGES files (PDF) supposed to be in the archive. Which means only 25 PDF files are actually missing.

You’d notice that 25 PDF files couldn’t possibly be the remaining 80-ish GB that remains of the original DataSet 9, but the DAT file doesn’t reveal how many NATIVES there were.

NATIVES are media files like videos and audio. You can see an example if you have a full DataSet 10. But from DataSet 10 it reveals to us that all NATIVES have a placeholder as a PDF which is always 4670 bytes.

So by searching all files that are that exact size, it reveals there are about 135 NATIVES (media files) that are missing, which would be the rest of the 80 GB that is missing.

I have listed below what IMAGES (PDF) and NATIVES (media) files are missing, such that it is easy to coordinate to track down the remaining files that we need for a complete DataSet 9.

(Though the remaining PDFs could be placeholder for up to 25 more natives, which would have to be checked when finding them).

Update 1 (February 6):

In my original post (https://lemmy.world/post/42700643), I found that NATIVEs have a placeholder that is 4670 bytes.

However, from comparing every NATIVE in DataSet 10 to it’s placeholder I have discovered a second placeholder size that is 2433 bytes.

The NATIVEs estimate is now 2542 (from previous 135).

I have attached the updated NATIVEs list. (And also the same 25 missing IMAGES list (since they also could be native placeholders).

NEW_MISSING_EFTA_NATIVES.txt

MISSING_EFTA_IMAGES.txt

Update 2 (February 6):

I have found 1983/2542 NATIVEs are directly downloadable from the DOJ.

1983_NATIVES_URLS.txt

If anyone wants to attempt the remaining natives, I have tried the following extensions: ".avi",".mp4",".mov",".mp3",".wav",".m4a",".m4v",".wmv",".ts",".vob",".3gp",".amr",".opus",".csv",".xlsx",".xls",".docx",".doc",".pluginpayloadattachment"

  • TropicalDingdong@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    6 days ago

    Hi @ermstein@lemmy.world

    I was following your thread on Dataset 9 on github. I’m still having issues with various versions and torrents I’ve tried to use to get these data.

    I was wondering if you have a viable high quality link or torrent where the most-complete Dataset 9 might be available?

  • Coelacanth@feddit.nu
    link
    fedilink
    arrow-up
    7
    ·
    12 days ago

    That’s good news. With the amount of people interested in these files and in data preservation it’s bound to be only a matter of time until the whole dataset 9 is restored. Someone out there’s gotta have the rest of the files.

  • iByteABit@lemmy.ml
    link
    fedilink
    arrow-up
    6
    ·
    12 days ago

    I know it would be ridiculously ironic, but if any CSAM is in there could the authorities get you in trouble over its possession or even distribution?

    • Coelacanth@feddit.nu
      link
      fedilink
      arrow-up
      8
      ·
      12 days ago

      Probably. CSAM is CSAM, I’m not sure the law would differentiate. Probably one of the reasons Dataset 9 has taken time to get restored, as I believe it was said it had some accidentally unredacted/uncensored CSAM in it?

      • Dhoard@lemmy.world
        link
        fedilink
        arrow-up
        3
        ·
        11 days ago

        Don’t believe them. I have proof that in one of the documents, the “allen oren tal” brothers were redacted after they re-released dataset9.

  • CapableStaircase@lemmy.zip
    link
    fedilink
    arrow-up
    4
    ·
    12 days ago

    You rock. I didn’t realize NATIVEs had a placeholder PDF. I’ll try and scrape the media files tonight to add to the existing dataset 9 more complete archive.

    • CapableStaircase@lemmy.zip
      link
      fedilink
      arrow-up
      2
      ·
      11 days ago

      I could only grab ~44 of the NATIVEs you’ve listed and they total up to a tiny portion of the expected 80GB remaining. The hard part is guessing what file extension these files will have without getting rate limited by DOJ. I was hoping to get a copy of the zip file’s EOCD but it’s still down.

      If anyone ever sees that zip come back please try and download the last 150-200MB. That’s where the zip archive’s table of contents is gonna live.

      • ermstein@lemmy.worldOP
        link
        fedilink
        arrow-up
        3
        ·
        11 days ago

        One thing you could try is looking at the file extensions from DataSet 10’s Natives so you have fewer to guess from.

        The rest of the natives still could be that large but I’ll double check if there are other placeholders.

      • ermstein@lemmy.worldOP
        link
        fedilink
        arrow-up
        2
        ·
        11 days ago

        I have updated the post with a list of 2542 NATIVEs instead of 135 after finding a second placeholder size of 2433 bytes.

  • CapableStaircase@lemmy.zip
    link
    fedilink
    arrow-up
    3
    ·
    8 days ago

    I took the same list provided by this post and added a few more extensions to the search. In doing so I was able to successfully download 2327/2542 NATIVE files. I performed this search by making HEAD requests for each URL before trying to download them with a GET request. This search method resulted in me finding an additional 3 files that gave Content-Type and Content-Length in the HEAD response but ultimately “disappeared” and gave a 404 when performing a GET response.

    NOTE:

    • All MS office files (.doc(x), .xls(x), .ppt(x)) are exactly ZERO bytes long.

    • There are two sqlite .db files which are password protected and I have not yet tried to crack.

    • Lots of jail footage

    • I think very small .avi videos which many sequential Bates numbers are actually single frames that need to be recombined into the original video. I have not done so.

    Extensions I tried:

    dataset10:
    
    avi, mp4, mov, mp3, wav, m4a, m4v, wmv, ts, vob, 3gp, amr, opus, csv, xlsx, xls, docx, doc, pluginpayloadattachment
    
    common-audio:
    
    m4a, mp3, wav, aac, flac, ogg, wma, aiff, opus, m4b
    
    common-video:
    
    mp4, mov, avi, wmv, mkv, webm, m4v, mpg, mpeg, 3gp
    
    uncommon-audio:
    
    ac3, amr, mka, au, ra, mid, aif, dts, caf, gsm, ape, wv, spx, mpc, snd, voc, tta, tak, dsf, dff
    
    uncommon-video:
    
    flv, vob, ts, ogv, m2ts, mts, asf, 3g2, f4v, divx, rm, rmvb, m2v, dv, xvid, swf, m4s, hevc, h264, h265
    
    rare-audio:
    
    8svx, amb, au, avr, cda, cvs, cvsd, cvu, dss, dvms, fap, fssd, gsrt, hcom, htk, ima, ircam, maud, nist, paf, prc, pvf, sd2, sds, sf, smp, sou, txw, vms, w64, wve, xa, aifc, al, ul, la, sb, sw, ub, uw
    
    rare-video:
    
    264, 265, 302, 3p2, 787, 890, aec, aep, aepx, ajp, ale, am, amc, amv, arcut, arf, avb, avc, avd, avp, avs, awlive, axm, bdm, bdmv, bik, bix, bmk, bnp, box, bs4, bsf, bu, camproj, camrec, ced, cine, cip, clpi, cmmp, cmmtpl, cmproj, cmrec, cpi, cst, cx3, d2v, d3v, dash, dat, dce, dck, dcr, dcr, ddat, dif, dir, dlx, dmb, dmsd, dmsd3d, dmsm, dmsm3d, dmss, dnc, dpa, dpg, dream, dsy, dv4, dvdmedia, dvr, dvr-ms, dvx, dxr, dzm, dzp, dzt, edl, evo, eye, f4p, fbr, fbz, fcp
    
    documents:
    
    pdf, doc, docx, txt, rtf, odt, xls, xlsx, csv, ppt, pptx, odp, html, htm, xml, json, md, tex, epub, mobi
    
    images:
    
    jpg, jpeg, png, gif, bmp, tiff, tif, webp, svg, ico, raw, cr2, nef, orf, sr2, psd, ai, eps, heic, heif
    
    archives:
    
    zip, rar, 7z, tar, gz, bz2, xz, iso, dmg, cab, lz, lzma, zst, lz4, sz, z, tgz, tbz2, txz, tlz, tar.gz, tar.bz2, tar.xz, tar.zst, tar.lz, tar.lzma, tar.lz4, tar.z, [tar.sz](http://tar.sz/)
    
    epstein:
    
    apmaster, apversion, attr, bmp, bup, dat, data, db, db-journal, doc, ds\_store, f catalog, f\_catalog, ifo, images #1, images #2, iphoto, ivc, mpg, NULL, pdf, pps, ps, psb, psd, raf, tif, tiff, tropez, txt, xml
    

    Torrent file: https://archive.org/details/data-set-9-native.tar.xz

    NOTE: See INFO folder for more information.