Preferences

dplavery92
Joined 227 karma

  1. The title of Tencent's paper [0] as well as their homepage for the model [1] each use the term "Open-Source" in the title, so I think they are making the claim.

    [0] https://arxiv.org/pdf/2411.02265 [1] https://llm.hunyuan.tencent.com/

  2. Eight is a nice power of two.
  3. You can also construct multiple hypothesis trackers from multiple Kalman Filters, but there is a little more machinery. For example, Interacting Multiple Models (IMM) trackers may use Kalman Filters or Particle Filters, and a lot of the foundational work by Bar-Shalom and others focuses on Kalman Filters.
  4. The Kalman filter has a family of generalizations in the Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF.)

    Also common in robotics applications is the Particle Filter, which uses a Monte Carlo approximation of the uncertainty in the state, rather than enforcing a (Gaussian) distribution, as in the traditional Kalman filter. This can be useful when the mechanics are highly nonlinear and/or your measurement uncertainties are, well, very non-Gaussian. Sebastian Thrun (a CMU robotics professor in the DARPA "Grand Challenge" days of self-driving cars) made an early Udacity course on Particle Filters.

  5. I was encountering the same problem on my Intel MBP, and per another one of the comments here, find that switching from Chrome to Safari to view the page allows me to view the whole page, view it smoothly, and without my CPU utilization spiking or my fans spinning up.
  6. I don't think anyone in this thread knows what happened, but since we're in a thread speculating why the CEO of the leading AI company was suddenly sacked, the possibility of an unacceptable interpersonal scandal isn't any more outlandish than others' suggestions of fraud, legal trouble for OpenAI, or foundering financials. The suggestion here is simply that Altman having done something "big and dangerous" is not a foregone conclusion.

    In the words of Brandt, "well, Dude, we just don't know."

  7. Correct, the article places UHZ1 at 13.2 billion light-years away, so roughly ~500 Gy into our 13.7-billion-year-old universe.
  8. From the captioned art in the article: "Siege, from the Peterborough Psalter, early 14th century, via the KBR Museum, Belgium. Yes, those defenders are all women."
  9. I think a great place to start is https://www.bzarg.com/p/how-a-kalman-filter-works-in-picture...

    Unlike the OP article, it does make use of the math formalism for Kalman filters, but it is a relatively gentle introduction that does a very good job visualizing and explaining the intuition of each term. I have gotten positive feedback (no pun intended!) from interns or junior hires using this resource to familiarize themselves with the topic.

    If you are making a deeper study and are ready to dive into a textbook that more thoroughly explores theory and application, there is a book by Gibbs[1] that I have used in the past and is well-regarded in some segments of industry that rely on these techniques for state estimation and GNC.

    [1] https://onlinelibrary.wiley.com/doi/book/10.1002/97804708900...

  10. From Sections 3 and 4 of the VQGAN paper[1] upon this work is built: "To generate images in the megapixel regime, we ... have to work patch-wise and crop images to restrict the length of [the quantized encoding vector] s to a maximally feasible size during training. To sample images, we then use the transformer in a sliding-window manner as illustrated in Fig.3." ... "The sliding window approach introduced in Sec.3.2 enables image synthesis beyond a resolution of 256×256pixels."

    From the Paella paper[2]: "Our proposal builds on the two-stage paradigm introduced by Esser et al. and consists of a Vector-quantized Generative Adversarial Network (VQGAN) for projecting the high dimensional images into a lower-dimensional latent space... [w]e use a pretrained VQGAN with an f=4 compression and a base resolution of 256×256×3, mapping the image to a latent resolution of 64×64indices." After training, in describing their token predictor architecture: "Our architecture consists of a U-Net-style encoder-decoder structure based on residual blocks,employing convolutional[sic] and attention in both, the encoder and decoder pathways."

    U-Net, of course, is a convolutional neural network architecture. [3]. The "down" and "up" encoder/decoder blocks in the Paella code are batch-normed CNN layers. [4]

    [1] https://arxiv.org/pdf/2012.09841.pdf [2] https://arxiv.org/pdf/2211.07292.pdf [3] https://arxiv.org/abs/1505.04597 [4] https://github.com/dome272/Paella/blob/main/src/modules.py#L...

  11. Transformers are not forced to use a specific input (or output) shape; the original ViT paper demonstrates interpolating positional embeddings to inference with arbitrary image shapes.
  12. Presumably a transformer model or similar that uses positional encodings for the tokens could do that, but the U-Net decoder here uses a fixed-shape output and learns relationships between tokens (and sizes of image features) based on the positions of those tokens in a fixed-size vector. You could still apply this process convolutionally and slide the entire network around to generate an image that is an arbitrary multiple of the token size, but image content in one area of the image will only be "aware" of image content at a fixed-size neighborhood (e.g. 256x256).
  13. Eh, it's a little tricky. A lot of research marketed under the "AI" umbrella would be categorized under cs.LG (https://arxiv.org/list/cs.LG/recent), cs.CV (https://arxiv.org/list/cs.CV/recent), cs.CL (https://arxiv.org/list/cs.CL/recent), and to a lesser degree cs.NE (https://arxiv.org/list/cs.NE/recent). Oh, and of course, cs.AI (https://arxiv.org/list/cs.AI/recent). Not every one of those areas has grown monotonically, but the growth in CV and CL especially has been explosive over the last ten years.
  14. Alternatively, "The Entertainment" in Infinite Jest.
  15. Sure, but when one 12gb GPU costs ~$800 new (e.g. for the 3080 LHR), "a couple of dozens" of them is a big barrier to entry to the hobbyist, student, or freelancer. And cloud computing offers an alternative route, but, as stated, distribution introduces a new engineering task, and the month-to-month bills for the compute nodes you are using can still add up surprisingly quickly.
  16. NNs are potentially very powerful arbitrary function approximators, but you have very limited control (or, arguably, insight) into the precise nature of the solutions their optimization arrives at. Because of that, they've been especially well suited to problems in vision and NLP where we have basic intuition about the phenomenology but can't practically manage a formal description of that intuition (and enumerating that description is probably not of great intellectual interest): what, in pixel space, makes a cat a cat or a dog a dog? What, in patterns of natural words, indicates sarcasm or positive/negative sentiment?

    They also get tons of use in results-oriented modeling of lots of other statistics questions in structured data (home prices, resource allocation, voter turnouts, etc.) but in this luddite's opinion, these sorts of applications tend to be pretty fraught if they short-change the convenience of the model training paradigm for a deeper understanding of the data phenomenology.

  17. Be that as it may, a number of positions at LLNL, including many of those affiliated with NIF, require that candidate is a US person and is eligible for a DOE security clearance. A security clearance is not necessarily binary on being a US person, but a number of national-security related positions may require not only the clearance, but also that the candidate is a US person (or outright forbid foreign nationals.)
  18. This is not quite correct. LLNL is a Federally Funded Research & Development Center (FFRDC) which is owned, as a facility, by the government, but managed and staffed by a non-profit contracting organization called Lawrence Livermore National Security, LLC (LLNS) under a contract funded by DOE/NNSA. The board of LLNS is made up of representatives from universities (California + TAMU), other scientific non-profits (Battelle Memorial Institute), and private nuclear ventures (e.g. Bechtel.) LLNS pays, with very few exceptions, staff salaries at LLNL, and they are not beholden to the government civilian pay schedule.

    https://www.llnl.gov/about/management-sponsors

  19. For what it's worth, I'm also very bad at plotting graphs with any kind of accuracy, which is why I use plotting software instead of doing it by hand.

    I get the feeling that my visual system and the language I use are respectively pretty bad at processing and conveying precise information from a plot, (beyond simple descriptors like "A is larger than B" or "f(x) has a maximum"). I guess I would find it mildly surprising if any Vision-Language model were able to perform those tasks very well, because the representations in question seem pretty poorly suited.

    I get that popular diffusion models for image generation are doing a bad job composing concepts in a scene and keeping relationships constant over the image--even if Stable Diffusion could write in human script, it's a bad bet that the contents of a legend would match a pie chart that it drew. But other Vision-Language models, designed for image captioning or visual question answering, rather than generating diverse, stylistic images, are pretty good at that compositional information (up to, again, the "simple descriptions" level of granularity I mentioned before.)

  20. From the parent article:

    >Importantly, this diffractive camera is not based on a standard point-spread function-based pixel-to-pixel mapping between the input and output FOVs, and therefore, it does not automatically result in signals within the output FOV for the transmitting input pixels that statistically overlap with the objects from the target data class. For example, the handwritten digits ‘3’ and ‘8’ in Fig. 2c were completely erased at the output FOV, regardless of the considerable amount of common (transmitting) pixels that they statistically share with the handwritten digit ‘2’. Instead of developing a spatially-invariant point-spread function, our designed diffractive camera statistically learned the characteristic optical modes possessed by different training examples, to converge as an optical mode filter, where the main modes that represent the target class of objects can pass through with minimum distortion of their relative phase and amplitude profiles, whereas the spatial information carried by the characteristic optical modes of the other data classes were scattered out.

    It seems like that may not be so possible.

    Later on in the article:

    >It is important to emphasize that the presented diffractive camera system does not possess a traditional, spatially-invariant point-spread function. A trained diffractive camera system performs a learned, complex-valued linear transformation between the input and output fields that statistically represents the coherent imaging of the input objects from the target data class.

    Note here that the learned transformations are linear, and the Fourier Transform is linear, but you cannot invert from output to input because the sensor measures real-valued intensities of complex-valued fields. All the phase information is lost.

  21. In the unCLIP/DALL-E 2 paper[0], they train the encoder/decoder with 650M/250M images respectively. The decoder alone has 3.5B parameters, and the combined priors with the encoder/decoder are the in the neighborhood of ~6B parameters. This is large, but small compared to the name-brand "large language models" (GPT3 et. al.)

    This means the parameters of the trained model fit in something like 7GB (decoder only, half-precision floats) to 24GB (full model, full-precision). To actually run the model, you will need to store those parameters, as well as the activations for each parameter on each image you are running, in (video) memory. To run the full model on device at inference time (rather than r/w to host between each stage of the model) you would probably want an enterprise cloud/data-center GPU like an NVIDIA A100, especially if running batches of more than one image.

    The training set size is ~97TB of imagery. I don't think they've shared exactly how long the model trained for, but the original CLIP dataset announcement used some benchmark GPU training tasks that were 16 GPU-days each. If I were to WAG the training time for their commercial DALL-E 2 model, it'd probably be a couple of weeks of training distributed across a couple hundred GPUs. For better insight into what it takes to train (the different stages/components of) a comparable model, you can look through an open-source effort to replicate DALL-E 2.[2]

    [0] https://cdn.openai.com/papers/dall-e-2.pdf [1] https://openai.com/blog/clip/ [2] https://github.com/lucidrains/dalle2-pytorch

  22. They became very popular during the pandemic for restaurants to offer their menu as a scannable QR code instead of distributing and sanitizing hand-held paper menus. I think this is becoming less relevant now.

    They're popular for street-level advertisements for (concerts, events, clubs, jobs) to provide a link to more information without the consumer having to type a url into the browser. This is more important if you don't want to pay for a pithy high-level domain as a landing page for the thing you are advertising.

    I have also seen them for easily entering WiFi credentials to your device on a new network.

    They can be used in museums to link to self-guided tour information on the artwork/exhibits you are looking at. This used to be done more with rentable/loanable walkman+headset kind of devices, but makes more sense to do on one's own mobile device now.

  23. It's worth mentioning that there has been some criticism[0] of the initial science behind the Nuclear Winter proposition. That said, smoke and soot can have a cooling effect on the Earth's temperature and CO2 (and other "greenhouse gasses") a warming effect because they absorb different wavelengths of light.

    The sun is a very hot (~5500K) blackbody that emits radiation in a broad spectrum, but that spectrum peaks in the visible. Some of that light is incident on Earth and warms it up. Earth also emits its own blackbody radiation, but it's much cooler (~300K), so it emits much less power over all and its spectrum peaks somewhere in the long infrared. The system is in equilibrium when the sun has heated the Earth enough that the total energy radiated away from Earth is equal to the fraction of the sun's radiation that is absorbed by Earth.

    Earth's atmosphere can change this equilibrium temperature by changing the fraction of incident energy that is absorbed by Earth, or changing its emissivity. Moreover, these changes can be wavelength dependent. Greenhouse gasses are gasses that are transparent in the visible spectrum but are reflective in the infrared, which allow in most of the sun's energy, but "trap" infrared energy that is being emitted by earth. Smoke and soot, by contrast, are very reflective in visible light (we can see them!) and so "block" much of the sun's energy from heating the Earth.

    [0]https://en.wikipedia.org/wiki/Nuclear_winter#Criticism_and_d...

  24. It seems unlikely in that situation that the Pentagon would declassify any of the instrument videos. Then again, it's weird that they've done that in any scenario.
  25. The David Fravor interview linked describes an event in 2004 seen by fighters from the USS Nimitz and captured by the USS Princeton. This is a separate event, and perhaps with very different vehicles or phenomena, than the 2019 event near the USS Omaha that is described above.
  26. Blind deconvolution is useful in astronomy and space domain awareness where you can make reasonable assumptions that the support for the signal pixels is limited against a blank background. This isn't the case in everyday terrestrial photography, and the Lena image on that wiki shows the sorts of results you get from applying a blind deconvolution algorithm naively to this sort of imagery.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal