Hello all,
I am cross posting this from stackoverflow as I was did not receive good response or it was not well received,
I have just started to learn programming using python and hence asking a pythonic way (not pandas as it is very advanced) to approach this problem. any tips or comments is appreciated as it will help me learn.
http://stackoverflow.com/questions/43482716/creating-a-matrix-using-python-for-biologist
Thank you.
What exactly couldn't you manage with the
pandas
solution in that thread? It looks pretty well broken down to me and that's exactly how I'd go about doing it. If you're struggling to install pandas, try:python -m pip install pandas
.If you don't have administrative access, consider using the Anaconda/Miniconda distributions.
While we sympathise with your novice status, there is no real substitute for just spending the time trying to figure it out - especially when you're asking for a kind of general solution to a specific task (only you know exactly what your data is going to look like each time).
hello @jrj.healey,
I am trying to learn as i go pandas seems to be way advanced, if i can do it in a non pythonic way it would make more sense to me and more over the final output i get does not match the read counts from individual files, may be it has to do with difference in each file, some files have 222000 lines where as some of them have 300000 lines.
I'm not sure what you mean by a more 'pythonic' way? The ethos of python is this:
I'd say in this case
pandas
is the aforementioned 'obvious way' (with the possible exception of Fuzzy's solution withcsv
). It might seem like an advanced step, but what you're basically doing is giving yourself access to many clever functions which have been prewritten to allow you to manipulate dataframes. I'd wager you'll have net saving of time investment if you spent it learning a bit ofpandas
instead of struggling throughfor
,if
&else
statements.You'll need to edit your question and provide reproducible input files and commands before we can address what problems you might have had with the various solutions - there isn't really enough to go on at the moment.
Is there a reason this has to be in python? In R, it's
outer_join()
from the dplyr package.BTW, in reality the genes in the files will all be present and in the same order, so you'll never have missing values.
Dear Devon,
thank you for your reply, that was my assumption to but i am dealing with gene names that have been collapsed from tophat junction bed file, I have to make a matrix of these gene names, so there could be a chance that at some places there would be no reads mapping to junction. eg
i need to do this to make a dataset that i can use for a tool where i am trying to see differential junction usage