Apologies if crossposting is against the rules; I’m not entirely sure where the lines are drawn here yet.

I posted this in lemmy.fosscad (fosscad@lemmy?), but realize that may not be the most active venue.

I grabbed archives of fosscad and took a look at the contents of the zst’s. I think I could probably rebuild the contents of the subreddit in some manner or another; the question is scale and hosting. How would we make the posts easily searchable, where would they live, what endpoint can we upload hundreds of thousands of comments into in a reasonable time frame… all that fun stuff.

The archives don’t contain pictures, but contain links to the pictures and the ones I’ve checked are currently still live (meaning the pics are still hosted on reddit). Dunno how long that will remain the case.

I have no idea what the size of the archives would be with pics downloaded; gigs, a TB, no clue. I’m posting this to gauge public interest and I haven’t done much preliminary work (oh, these are json. Yep, dictionaries work. Wingo.)

Is there any interest in making this more publicly available? I’ve run into an issue with a particular build and I’ll be diving through the archives to fix it for my self. It seems like a shame that all this information would be inaccessible to everyone who isn’t able or interested in trawling through their own local archives.

I’m not a programmer by trade, but work in an adjacent space. I can plink along on this if other people are interested (and if anyone is interested enough to help pitch in, even better).

  • hoglegOP
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    1 day ago

    Had to rearrange some things, but I’m pulling data from end of 2022 through 2024. It’s a chonker. This is everything, so will need to parse through and find anything id’d as fosscad. Not sure how long it would take to iterate through all of that; it’s over 1TB.

    I’ll be back this evening to update progress; download speed is pretty decent so if no big changes, should have the raw files tonight.

    Edit Happy surprise; it’s everything from 2023 to 06/2025. So losing the last handful of months of data (unless more gets added later). Still a pretty huge win.

    Thanks to Grey Summit Gear for kicking the shit out of this, and the folks who pulled all these dumps!

    • Grey Summit Gear@fosscad.io
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 day ago

      Flipping fantastic. Can’t stop the signal!

      I need to make my alt-github to I can PR the scripts and code I wrote to upload all this.

      When you get the new zsts posted lmk and I’ll start another upload session.

      Amazing work.

      • hoglegOP
        link
        fedilink
        English
        arrow-up
        2
        ·
        4 hours ago

        Just a heads up; I was able to finish pulling down the archives, but it’s going to take awhile to parse; wasn’t expecting to need this much storage touching my compute lol. I’m hoping I can have those ready for upload tonight or early tomorrow AM.

      • hoglegOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        23 hours ago

        Amen to that!

        Looks like the download speed dipped a bit while I was out; around halfway through now. So another 12 hours or so before the torrents are done, sooner if it picks back up, then I’ll need to parse.

        I’ll ding you as soon as I’ve got the new zst’s up.

        And thank you! Would have taken me a week to put that scripting together.

        Go team venture.

          • hoglegOP
            link
            fedilink
            English
            arrow-up
            2
            ·
            21 hours ago

            Hey, if you hadn’t built this platform for us, it wouldn’t be happening. Seeing a bit of the old internet come together really warms my cold, dead heart.

  • Grey Summit Gear@fosscad.io
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    2 days ago

    EDIT:

    We have all 25k posts from the archive uploaded.

    We’re now streaming up comments. Already have 80k/380k.

    Join the ark! !ark@fosscad.io


    !ark@fosscad.io - now has over 14,000 posts.

    From 2015 - 2022.

    Many have pictures that we’ve saved.

    I have 380,000 comments I’m writing a script to post as well.

  • hoglegOP
    link
    fedilink
    English
    arrow-up
    4
    ·
    1 day ago

    Reposting here in case it gets lost in the sauce:

    There should be more; they may be split into several differential files. I’m going to work on getting the others right now, but I’ve got limited time before I have to leave for work. If I can’t get them up in time, I’ve been using resources like https://academictorrents.com/details/ba051999301b109eab37d16f027b3f49ade2de13/tech&filelist=1 (if url’s can’t be posted, it’s academictorrents dot com, posts by Watchful1) I could have sworn I grabbed newer data for the fosscad subreddit specifically from a different site, but may have to go through the monthly diffs in this link and pull out anything under the /fosscad id.

    These are much bigger, since they’re the entirety of reddit (or top 20,000 subs, something like that), so download and parse is going to take a lot longer. Several hundred gigs to pull and parse. On review, no way I can do it all this morning but I can at least get some downloads cooking. Going to start pulling comments/subs for 2023, but won’t be able to check on any of this until roughly 6pm eastern.

  • hoglegOP
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 days ago

    Ah dude, you rock! Sorry, fell out early last night. There should be updated archives, up to 2024 at least. Let me see if I can find them; may end up being several differential files.

    I’ve got an hour before I have to do meatspace stuff, but I’ll see what I can knock out.

  • Grey Summit Gear@fosscad.io
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    2 days ago

    I’m just starting on this.

    I started with all i.reddit.com links. There are roughly 6k urls. I wrote a script to bulk download them but save the name as a hash of the url + ext.

    Got 4.5k already - a boatload of 404s.

    How old is this archive? There could be a lot of deleted posts that we retained over the years.

    -----
    Total URLs seen: 6317
    OK:              4531
    Missing:         1786
    Zero-bytes:      0
    Not image:       0
    

    6317 urls in the input. We got 71%


    Imgur and others next :)

    EDIT:

    We got 100% of the Imgur links

    -----
    Total URLs seen: 629
    OK:              629
    Missing:         0
    Zero-bytes:      0
    Not image:       0
    

    Going for the rest of the image links from all over :)

    Going to spin up !ark@fosscad.io - and create a bot to reupload a lot of the posts.

    EDIT 2:

    !ark@fosscad.io now exists.

    I finished off the post images (other domains than i.reddit.com and imgur)

    Total URLs seen: 11628
    OK:              9574
    Missing:         2051
    Zero-bytes:      0
    Not image:       3
    

    82.3% Pretty good.

    Gonna finish up with comments and then work on a script to reupload under a bot account.

    • Grey Summit Gear@fosscad.io
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago
      Total URLs seen: 1103
      OK:              970
      Missing:         105
      Zero-bytes:      0
      Not image:       28
      

      88% on the rest of the comment images. Now to wire up a bot.

    • The ShittinatorMA
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      This is fucking sick. Once we have the images, we have the full archive, and it’ll just be a matter of importing the data.

  • Grey Summit Gear@fosscad.io
    link
    fedilink
    English
    arrow-up
    8
    ·
    3 days ago

    I’d be interested in figuring out a way to either re-upload them onto Lemmy or similar.

    We should definitely download the pics while we still can.

    I have a bridge I wrote that pulls r/gundeals to !bazaar@fosscad.io - could do something similar with a bot to post to to a community.

    • hoglegOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      Oh, that’s awesome. If you’ve got a bot that can already parse and push to fosscad.io, we should definitely be able to tweak that. I’m not active on discord (any social media really), but I imagine that’s the place to organize an effort like this. If there is an alternative that others prefer (I’ve heard about matrix and element), I’m open to suggestion.

      I’ve been fighting the flu the last couple of days, but on the upswing now. Dunno how much I’ll be able to dig in today, but I’ll get a github set up; I can dump the zst’s there and some psuedocode and notes. If I can see what your bot is ingesting, I can try to match output from the zst’s to it.

      I think there will be a bit of work marrying comments to submissions; they’re split up into two separate archives. Since the pictures are time sensitive (potentially), maybe the move is trying to focus on looping through the submissions and grabbing the pics from their urls, then rebuilding after the fact.

      I’ve got plenty of local storage for pics or if we can dump straight to lemmy, that would be great. I’m completely ignorant to this platform as far as rate-limiting, storage, any of that fun stuff. I don’t know how big a whole subreddit will end up being, but I imagine it’s not inconsequential.

      • Grey Summit Gear@fosscad.io
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 days ago

        If you upload the ZSTs I can make a script that loops over image links and downloads them - I can also probably figure out a way to coordinate what section of the archive you run it on and I run it on and then we can kinda re-combine from there.

        • hoglegOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          2 days ago

          Awesome! https://github.com/hoglegcc/rebuildtheark

          I was able to add the submissions zst, but I’m running into a file limit size to add the comments archive. Compressed, the comments are around 35MB (480 uncompressed) and github has a limit of 25MB. Worst case scenario I can get it to you another way.

          I’m technically supposed to be doing something else right now, but I should be able to get the comments issue squared in the next couple of hours. Also, dunno if you want a private repo for the actual bot; don’t know if you keys you’re worried about or anything like that. I can dump a private repo, or if you want to make one and add me, either way works. Or if public repo works for you, that works for me, too.

          Thanks!

          • Grey Summit Gear@fosscad.io
            link
            fedilink
            English
            arrow-up
            2
            ·
            2 days ago

            Thanks. I gotta maybe make a new GitHub account if I don’t wanna doxx myself.

            But I will clone it in the meantime.

            • hoglegOP
              link
              fedilink
              English
              arrow-up
              2
              ·
              2 days ago

              I was able to get the comments uploaded after splitting them in 7zip. After combining with 7zip, hashes still match and first couple of items looked good to go.

              Also threw a python module and example loop in there, and an example comment and submission item in the readme.

              I was thinking of making a class for the threads, another class for comments and making each submission an instance of the thread class, comments as instances in a list for their parent thread so we can track comment author and utc for chronology.

              I can build out classes and assigning comments to threads. I imagine you’re going to be much faster at all of this than I am, so whenever you’ve got the image scrape ready, let me know and I’ll start grabbing my share of the links.

              • The ShittinatorMA
                link
                fedilink
                English
                arrow-up
                2
                ·
                2 days ago

                Is there more to this archive? Posts seem to cut off right at Sat Dec 31 2022 23:42:15 GMT+0000, which is suspiciously close to the end of a year.

                • hoglegOP
                  link
                  fedilink
                  English
                  arrow-up
                  3
                  ·
                  2 days ago

                  There should be more; they may be split into several differential files. I’m going to work on getting the others right now, but I’ve got limited time before I have to leave for work. If I can’t get them up in time, I’ve been using resources like https://academictorrents.com/details/ba051999301b109eab37d16f027b3f49ade2de13/tech&filelist=1 (if url’s can’t be posted, it’s academictorrents dot com, posts by Watchful1) I could have sworn I grabbed newer data for the fosscad subreddit specifically, but may have to go through the monthly diffs in this link and pull out anything under the /fosscad id.

  • woodsman240
    link
    fedilink
    English
    arrow-up
    6
    ·
    3 days ago

    Not sure how much i could actually help with development, but it is certainly something that is needed. There was a lot of info that will be lost to time otherwise

    • hoglegOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      2 days ago

      Any help is greatly appreciated. I’m not tied in to any social media, but if you’re fluent in discord, matrix, element, any of the other popular IRC-likes, getting a room/discord set up would be super helpful. Or maybe if there is an existing discord for the community, point me in the direction and I’ll try to join.

      If you’re not on social media, either, no worries (believe me, I understand). The encouragement helps, too.